← Back to Research

Large-Scale Optimization

Main Challenges

  • Hyperparameter Transfer: Modern large language models need optimizers that scale efficiently to billions of parameters. At the same time, extensive hyperparameter tuning becomes impractical at such scales due to limited resources. This makes it essential to develop theoretically grounded methods for transferring hyperparameters from smaller models to larger ones.
  • Adaptive Methods: Optimizers like Adam are widely used in practice, yet their theoretical properties remain only partially understood. Developing a predictive convergence theory for adaptive optimizers is therefore both important and challenging.
  • Hyperparameter Robustness: Because hyperparameter tuning is costly at large scale, modern algorithms must be robust to their choice. Small perturbations in hyperparameters should not lead to significant performance changes. Therefore, designing optimization methods that are insensitive to hyperparameter selection is crucial for large-scale settings.

My Contributions

BST Scaling Rule

We study stochastic conditional gradient methods under a fixed token budget [21] and show that an optimal batch size exists. Our analysis yields predictive convergence rates, enabling joint token-aware scaling laws for batch size, sequence length, Frank-Wolfe stepsize, and momentum as the model size or token horizon grows.

Robustness to Learning Rate Choice

We study the coupling of the NGN stepsize with Polyak momentum [17] and show that incorporating this combination into base optimizers such as SGD and Adam improves robustness to learning-rate choice. In some cases, increasing the learning rate by up to an order of magnitude beyond its optimal value leads to little or no performance degradation, while preserving peak performance.

SDE Analysis of Adam-like Optimizers

We study optimization methods such as SGD, SignSGD, and Adam through the lens of stochastic differential equations (SDEs) [12]. These SDEs provide quantitatively accurate models of the algorithms and reveal a nuanced relationship between adaptivity, gradient noise, and curvature. Our analysis of SignSGD highlights a precise contrast with SGD in terms of convergence speed, stationary distribution, and robustness to heavy-tailed noise.

Adam-SGD Gap

Prior work attributes Adam's advantage over SGD to various isolated factors. Rather than pinning the gap on a single cause [23], we show through controlled experiments that data modality, dataset, and model architecture all shape it in non-trivial ways. We further identify a consistent trend: a crossover batch size beyond which the relative advantage shifts from SGD to Adam, captured by a simple theoretical model of the gap.

Research Impact

These results improve training efficiency by reducing the need for costly hyperparameter sweeps and tuning, while increasing robustness. BST scaling and NGN step sizes lower computational overhead and sensitivity to learning rates, and SDE insights explain the strong empirical performance of adaptive methods under realistic, heavy-tailed noise.

Selected Publications

2026
[23] Beyond a Single Explanation of the Adam-SGD Gap.

PDF arXiv

Prior work has identified several factors that can contribute to the performance gap between Adam and SGD, spanning data aspects, architecture design, and optimization properties. Yet these explanations are often studied in isolation, leaving their relative importance unclear. In this work, we revisit these hypotheses through a controlled empirical study across vision, language, genomics, and graph tasks, spanning modern and classical architectures, and carefully designed training setups. Our results suggest that no single factor consistently explains the Adam--SGD gap. For instance, the Adam advantage can (1) persist under a uniform vocabulary distribution yet nearly disappear under a heavy-tailed one; (2) reverse in favor of SGD in softmax-attention models; and (3) become larger under soft architectural modifications, e.g., when ReLU is replaced by a GeLU nonlinearity. This suggests that the gap arises from nontrivial data and architecture interactions, rather than from a single common factor. Yet, we observe a pattern across our settings: a crossover batch size at which the relative advantage shifts from SGD to Adam as the batch size scales. These empirical results are captured by our theoretical gap model, which predicts this batch-size-dependent crossover. Our perspective helps reconcile several existing hypotheses while offering practical insights across domains.

[21] On the Role of Batch Size in Stochastic Conditional Gradient Methods.

43rd International Conference on Machine Learning (ICML 2026)
PDF arXiv ICML

We study the role of batch size in stochastic conditional gradient methods under a μ-Kurdyka-Łojasiewicz (μ-KL) condition. Focusing on momentum-based stochastic conditional gradient algorithms (e.g., Scion), we derive a new analysis that explicitly captures the interaction between stepsize, batch size, and stochastic noise. Our study reveals a regime-dependent behavior: increasing the batch size initially improves optimization accuracy but, beyond a critical threshold, the benefits saturate and can eventually degrade performance under a fixed token budget. Notably, the theory predicts the magnitude of the optimal stepsize and aligns well with empirical practices observed in large-scale training. Leveraging these insights, we derive principled guidelines for selecting the batch size and stepsize, and propose an adaptive strategy that increases batch size and sequence length during training while preserving convergence guarantees. Experiments on NanoGPT are consistent with the theoretical predictions and illustrate the emergence of the predicted scaling regimes. Overall, our results provide a theoretical framework for understanding batch size scaling in stochastic conditional gradient methods and offer guidance for designing efficient training schedules in large-scale optimization.

2025
[17]
Enhancing Optimizer Stability: Momentum Adaptation of The NGN Step-size.

39th Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
PDF arXiv NeurIPS Poster

Modern optimization algorithms that incorporate momentum and adaptive step-size offer improved performance in numerous challenging deep learning tasks. However, their effectiveness is often highly sensitive to the choice of hyperparameters, especially the step-size. Tuning these parameters is often difficult, resource-intensive, and time-consuming. Therefore, recent efforts have been directed toward enhancing the stability of optimizers across a wide range of hyperparameter choices [Schaipp et al., 2024]. In this paper, we introduce an algorithm that matches the performance of state-of-the-art optimizers while improving stability to the choice of the step-size hyperparameter through a novel adaptation of the NGN step-size method [Orvieto and Xiao, 2024]. Specifically, we propose a momentum-based version (NGN-M) that attains the standard convergence rate of O(1/\sqrt{K}) under less restrictive assumptions, without the need for interpolation condition or assumptions of bounded stochastic gradients or iterates, in contrast to previous approaches. Additionally, we empirically demonstrate that the combination of the NGN step-size with momentum results in enhanced robustness to the choice of the step-size hyperparameter while delivering performance that is comparable to or surpasses other state-of-the-art optimizers.

2024
[12] Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise.

13th International Conference on Learning Representations (ICLR 2025)
PDF arXiv ICLR Poster

Despite the vast empirical evidence supporting the efficacy of adaptive optimization methods in deep learning, their theoretical understanding is far from complete. This work introduces novel SDEs for commonly used adaptive optimizers: SignSGD, RMSprop(W), and Adam(W). These SDEs offer a quantitatively accurate description of these optimizers and help illuminate an intricate relationship between adaptivity, gradient noise, and curvature. Our novel analysis of SignSGD highlights a noteworthy and precise contrast to SGD in terms of convergence speed, stationary distribution, and robustness to heavy-tail noise. We extend this analysis to AdamW and RMSpropW, for which we observe that the role of noise is much more complex. Crucially, we support our theoretical analysis with experimental evidence by verifying our insights: this includes numerically integrating our SDEs using Euler-Maruyama discretization on various neural network architectures such as MLPs, CNNs, ResNets, and Transformers. Our SDEs accurately track the behavior of the respective optimizers, especially when compared to previous SDEs derived for Adam and RMSprop. We believe our approach can provide valuable insights into best training practices and novel scaling rules.