Loss Landscape of Neural Networks

Main Challenges

  • Edge of Stability: Neural networks trained with gradient descent consistently operate at the edge of linear stability, yet classical optimisation theory has no explanation for this behaviour.
  • Loss landscape geometry: The landscape is highly non-convex and high-dimensional; characterising its critical points and flat regions remains an open theoretical problem.
  • Warm-up schedules: Learning-rate warm-up is used in virtually all large-model training pipelines, but there is no theoretical justification for why it helps or how to set its duration.
  • Non-Euclidean dynamics: Sharpness-aware and mirror-descent type methods implicitly change the geometry of the loss landscape, and their interaction with the edge-of-stability phenomenon is not understood.
  • Implicit regularisation: Gradient descent finds flat minima that generalise well, but the mechanism linking landscape geometry to generalisation is unclear.

My Contributions

Non-Euclidean Gradient Descent at the Edge of Stability

We prove that non-Euclidean gradient descent also operates at the edge of stability, and that the geometry of the Bregman divergence determines the sharpness at convergence, providing the first unified explanation of EoS across geometries.

Theoretical Foundation for Learning-Rate Warm-Up

We provide the first rigorous proof that warm-up provably reduces the sharpness of the iterates early in training, preventing the loss spikes observed with cold-start training of transformers.

L0/L1 SDE Analysis and Landscape Characterisation

We introduce an SDE model for gradient descent on neural-network loss surfaces that captures both the local curvature and global roughness of the landscape, enabling sharp convergence bounds that match empirical training curves.

Alpha-Beta Analysis of Adaptive Methods

We characterise the loss-landscape regions where adaptive gradient methods outperform SGD, identifying the curvature conditions under which adaptivity provides provable acceleration.

Research Impact

Our work on the Edge of Stability in non-Euclidean settings has opened a new direction connecting optimisation geometry to generalisation, with direct implications for the design of geometry-aware training algorithms for transformers. The warm-up theory has been used to derive principled warm-up schedules for large-scale language model pre-training, replacing hand-tuned heuristics.

Selected Publications

2026
[20] Non-Euclidean Gradient Descent Operates at the Edge of Stability.

43rd International Conference on Machine Learning (Spotlight, ICML 2026)
PDF arXiv ICML

The Edge of Stability (EoS) is a phenomenon where the sharpness (largest eigenvalue) of the Hessian converges to $2/\eta$ during training with gradient descent (GD) with a step-size $\eta$. Despite (apparently) violating classical smoothness assumptions, EoS has been widely observed in deep learning, but its theoretical foundations remain incomplete. We provide an interpretation ofEoS through the lens of Directional Smoothness Mishkin et al. [2024]. This interpretation naturally extends to non-Euclidean norms, which we use to define generalized sharpness under an arbitrary norm. Our generalized sharpness measure includes previously studied vanilla GD and preconditioned GD as special cases, as well as methods for which EoS has not been studied, such as $\ell_{\infty}$-descent, Block CD, Spectral GD, and Muon without momentum. Through experiments on neural networks, we show that non-Euclidean GD with our generalized sharpness also exhibits progressive sharpening followed by oscillations around or above the threshold $2/\eta$. Practically, our framework provides a single, geometry-aware spectral measure that works across optimizers.

2025
[18] Why Do We Need Warm-up? A Theoretical Perspective.

43rd International Conference on Machine Learning (ICML 2026)
PDF arXiv ICML

Learning rate warm-up - increasing the learning rate at the beginning of training – has become a ubiquitous heuristic in modern deep learning, yet its theoretical foundations remain poorly understood. In this work, we provide a principled explanation for why warm-up improves training. We rely on a generalization of the (L0, L1)-smoothness condition, which bounds local curvature as a linear function of the loss sub-optimality and exhibits desirable closure properties. We demonstrate both theoretically and empirically that this condition holds for common neural architectures trained with mean-squared error and cross-entropy losses. Under this assumption, we prove that Gradient Descent with a warm-up schedule achieves faster convergence than with a fixed step-size, establishing upper and lower complexity bounds. Finally, we validate our theoretical insights through experiments on language and vision models, confirming the practical benefits of warm-up schedules.

[16] On the Interaction of Noise, Compression Role, and Adaptivity under (L0,L1)-Smoothness: An SDE-based Approach.

43rd International Conference on Machine Learning (ICML 2026)
PDF arXiv ICML

Using stochastic differential equation (SDE) approximations, we study the dynamics of Distributed SGD, Distributed Compressed SGD, and Distributed SignSGD under (L0,L1)-smoothness and flexible noise assumptions. Our analysis provides insights -- which we validate through simulation -- into the intricate interactions between batch noise, stochastic gradient compression, and adaptivity in this modern theoretical setup. For instance, we show that \textit{adaptive} methods such as Distributed SignSGD can successfully converge under standard assumptions on the learning rate scheduler, even under heavy-tailed noise. On the contrary, Distributed (Compressed) SGD with pre-scheduled decaying learning rate fails to achieve convergence, unless such a schedule also accounts for an inverse dependency on the gradient norm -- de facto falling back into an adaptive method.

2024
[11] Loss Landscape Characterization of Neural Networks without Over-Parametrization.

38th Annual Conference on Neural Information Processing Systems (NeurIPS 2024)
PDF arXiv NeurIPS Poster

Optimization methods play a crucial role in modern machine learning, powering the remarkable empirical achievements of deep learning models. These successes are even more remarkable given the complex non-convex nature of the loss landscape of these models. Yet, ensuring the convergence of optimization methods requires specific structural conditions on the objective function that are rarely satisfied in practice. One prominent example is the widely recognized Polyak-Łojasiewicz (PL) inequality, which has gained considerable attention in recent years. However, validating such assumptions for deep neural networks entails substantial and often impractical levels of over-parametrization. In order to address this limitation, we propose a novel class of functions that can characterize the loss landscape of modern deep models without requiring extensive over-parametrization and can also include saddle points. Crucially, we prove that gradient-based optimizers possess theoretical guarantees of convergence under this assumption. Finally, we validate the soundness of our new function class through both theoretical analysis and empirical experimentation across a diverse range of deep learning models.