← Back to Research

Loss Landscape of Neural Networks

Main Challenges

  • Edge of Stability: Neural networks trained with gradient based methods consistently operate at the edge of stability, a regime where spectral chararectiristic of the loss are oscillating around a particular threshold. The classical optimisation theory has no explanation for this behaviour.
  • Loss landscape geometry: Despite the highly nonconvex and high-dimensional nature of neural network loss landscapes, gradient-based methods perform remarkably well in practice. Developing a theoretical characterization of these landscapes remains an important challenge. Establishing connections between landscape structure and algorithm design is key to achieving efficient training.

My Contributions

Non-Euclidean Gradient Descent at the Edge of Stability

We show that non-Euclidean gradient descent also operates at the edge of stability, and that the optimizer's geometry (i.e., the choice of norm) determines the sharpness at convergence [20], providing a unified explanation of edge-of-stability behavior across geometries.

Theoretical Foundation for Learning-Rate Warmup

We provide a theoretical characterization of neural network loss landscapes based on the relationship between sharpness and loss. We show both theoretically and empirically that this characterization is tight near initialization [18]. Leveraging this insight, we justify the use of learning-rate warmup—a widely used technique—for achieving stable convergence.

Loss Landscape Characterization without Over-Parametrization

We propose a novel class of functions that characterizes the loss landscape of modern deep models without relying on heavy overparameterization and naturally accommodates saddle points [11]. Under this framework, we prove that gradient-based methods enjoy convergence guarantees and validate the approach through both theoretical analysis and empirical studies across a diverse set of architectures.

Research Impact

These results advance the theoretical understanding of neural network optimization by providing principled characterizations of loss landscapes and their interaction with optimization algorithms. Together, they bridge theory and practice by linking landscape structure, optimizer design, and training stability.

Selected Publications

2026
[20] Non-Euclidean Gradient Descent Operates at the Edge of Stability.

43rd International Conference on Machine Learning (Spotlight, ICML 2026)
PDF arXiv ICML

The Edge of Stability (EoS) is a phenomenon where the sharpness (largest eigenvalue) of the Hessian converges to $2/\eta$ during training with gradient descent (GD) with a step-size $\eta$. Despite (apparently) violating classical smoothness assumptions, EoS has been widely observed in deep learning, but its theoretical foundations remain incomplete. We provide an interpretation ofEoS through the lens of Directional Smoothness Mishkin et al. [2024]. This interpretation naturally extends to non-Euclidean norms, which we use to define generalized sharpness under an arbitrary norm. Our generalized sharpness measure includes previously studied vanilla GD and preconditioned GD as special cases, as well as methods for which EoS has not been studied, such as $\ell_{\infty}$-descent, Block CD, Spectral GD, and Muon without momentum. Through experiments on neural networks, we show that non-Euclidean GD with our generalized sharpness also exhibits progressive sharpening followed by oscillations around or above the threshold $2/\eta$. Practically, our framework provides a single, geometry-aware spectral measure that works across optimizers.

2025
[18] Why Do We Need Warm-up? A Theoretical Perspective.

43rd International Conference on Machine Learning (ICML 2026)
PDF arXiv ICML

Learning rate warm-up - increasing the learning rate at the beginning of training – has become a ubiquitous heuristic in modern deep learning, yet its theoretical foundations remain poorly understood. In this work, we provide a principled explanation for why warm-up improves training. We rely on a generalization of the (L0, L1)-smoothness condition, which bounds local curvature as a linear function of the loss sub-optimality and exhibits desirable closure properties. We demonstrate both theoretically and empirically that this condition holds for common neural architectures trained with mean-squared error and cross-entropy losses. Under this assumption, we prove that Gradient Descent with a warm-up schedule achieves faster convergence than with a fixed step-size, establishing upper and lower complexity bounds. Finally, we validate our theoretical insights through experiments on language and vision models, confirming the practical benefits of warm-up schedules.

[16] On the Interaction of Noise, Compression Role, and Adaptivity under (L0,L1)-Smoothness: An SDE-based Approach.

43rd International Conference on Machine Learning (ICML 2026)
PDF arXiv ICML

Using stochastic differential equation (SDE) approximations, we study the dynamics of Distributed SGD, Distributed Compressed SGD, and Distributed SignSGD under (L0,L1)-smoothness and flexible noise assumptions. Our analysis provides insights -- which we validate through simulation -- into the intricate interactions between batch noise, stochastic gradient compression, and adaptivity in this modern theoretical setup. For instance, we show that \textit{adaptive} methods such as Distributed SignSGD can successfully converge under standard assumptions on the learning rate scheduler, even under heavy-tailed noise. On the contrary, Distributed (Compressed) SGD with pre-scheduled decaying learning rate fails to achieve convergence, unless such a schedule also accounts for an inverse dependency on the gradient norm -- de facto falling back into an adaptive method.

2024
[11] Loss Landscape Characterization of Neural Networks without Over-Parametrization.

38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024)
PDF arXiv NeurIPS Poster

Optimization methods play a crucial role in modern machine learning, powering the remarkable empirical achievements of deep learning models. These successes are even more remarkable given the complex non-convex nature of the loss landscape of these models. Yet, ensuring the convergence of optimization methods requires specific structural conditions on the objective function that are rarely satisfied in practice. One prominent example is the widely recognized Polyak-Łojasiewicz (PL) inequality, which has gained considerable attention in recent years. However, validating such assumptions for deep neural networks entails substantial and often impractical levels of over-parametrization. In order to address this limitation, we propose a novel class of functions that can characterize the loss landscape of modern deep models without requiring extensive over-parametrization and can also include saddle points. Crucially, we prove that gradient-based optimizers possess theoretical guarantees of convergence under this assumption. Finally, we validate the soundness of our new function class through both theoretical analysis and empirical experimentation across a diverse range of deep learning models.