We show that non-Euclidean gradient descent also operates at the edge of stability, and that the optimizer's geometry (i.e., the choice of norm) determines the sharpness at convergence [20], providing a unified explanation of edge-of-stability behavior across geometries.
We provide a theoretical characterization of neural network loss landscapes based on the relationship between sharpness and loss. We show both theoretically and empirically that this characterization is tight near initialization [18]. Leveraging this insight, we justify the use of learning-rate warmup—a widely used technique—for achieving stable convergence.
We propose a novel class of functions that characterizes the loss landscape of modern deep models without relying on heavy overparameterization and naturally accommodates saddle points [11]. Under this framework, we prove that gradient-based methods enjoy convergence guarantees and validate the approach through both theoretical analysis and empirical studies across a diverse set of architectures.
These results advance the theoretical understanding of neural network optimization by providing principled characterizations of loss landscapes and their interaction with optimization algorithms. Together, they bridge theory and practice by linking landscape structure, optimizer design, and training stability.
43rd International Conference on Machine Learning (Spotlight, ICML 2026)
PDF
arXiv
ICML
Abstract
43rd International Conference on Machine Learning (ICML 2026)
PDF
arXiv
ICML
Abstract
43rd International Conference on Machine Learning (ICML 2026)
PDF
arXiv
ICML
Abstract
38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024)
PDF
arXiv
NeurIPS
Poster
Abstract