We prove that non-Euclidean gradient descent also operates at the edge of stability, and that the geometry of the Bregman divergence determines the sharpness at convergence, providing the first unified explanation of EoS across geometries.
We provide the first rigorous proof that warm-up provably reduces the sharpness of the iterates early in training, preventing the loss spikes observed with cold-start training of transformers.
We introduce an SDE model for gradient descent on neural-network loss surfaces that captures both the local curvature and global roughness of the landscape, enabling sharp convergence bounds that match empirical training curves.
We characterise the loss-landscape regions where adaptive gradient methods outperform SGD, identifying the curvature conditions under which adaptivity provides provable acceleration.
Our work on the Edge of Stability in non-Euclidean settings has opened a new direction connecting optimisation geometry to generalisation, with direct implications for the design of geometry-aware training algorithms for transformers. The warm-up theory has been used to derive principled warm-up schedules for large-scale language model pre-training, replacing hand-tuned heuristics.
43rd International Conference on Machine Learning (Spotlight, ICML 2026)
PDF
arXiv
ICML
Abstract
43rd International Conference on Machine Learning (ICML 2026)
PDF
arXiv
ICML
Abstract
43rd International Conference on Machine Learning (ICML 2026)
PDF
arXiv
ICML
Abstract
38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024)
PDF
arXiv
NeurIPS
Poster
Abstract