Large-Scale Optimization

Main Challenges

  • Hyperparameter Transfer: Modern large language models need optimizers that scale efficiently to billions of parameters. At the same time, extensive hyperparameter tuning becomes impractical at such scales due to limited resources. This makes it essential to develop theoretically grounded methods for transferring hyperparameters from smaller models to larger ones.
  • Adaptive Methods: Optimizers like Adam are widely used in practice, yet their theoretical properties remain only partially understood. Developing a predictive convergence theory for adaptive optimizers is therefore both important and challenging.
  • Hyperparameter Robustness: Because hyperparameter tuning is costly at large scale, modern algorithms must be robust to their choice. Small perturbations in hyperparameters should not lead to significant performance changes. Therefore, designing optimization methods that are insensitive to hyperparameter selection is crucial for large-scale settings.

My Contributions

BST Scaling Rule

We study the convergence of stochastic conditional gradient methods under a fixed token budget. Our analysis shows that there exists an optimal batch size that yields the best performance. Moreover, the resulting convergence rates are predictive, enabling us to derive joint token-aware scaling laws for batch size, sequence length, Frank–Wolfe stepsize, and momentum parameters as the model size or token horizon grows.

Robustness to Learning Rate Choice

We study the coupling of the NGN stepsize with Polyak momentum. Incorporating this combination into base optimizers such as SGD and Adam improves their robustness to the choice of learning rate. In some cases, increasing the learning rate by up to an order of magnitude beyond its optimal value results in little to no performance degradation. Importantly, the peak performance of the base optimizer is preserved.

SDE Analysis of Adam-like Optimizers

We study optimization algorithms such as SGD, Sign SGD, and Adam through the lens of stochastic differential equations (SDE). These SDEs offer a quantitatively accurate description of these optimizers and help illuminate an intricate relationship between adaptivity, gradient noise, and curvature. Our novel analysis of SignSGD highlights a noteworthy and precise contrast to SGD in terms of convergence speed, stationary distribution, and robustness to heavy-tail noise.

Research Impact

BST scaling reduces resource use by eliminating exhaustive sweeps for empirical scaling rules. It also improves training efficiency by adapting hyperparameters to the token horizon. Incorporating NGN step sizes into base algorithms such as SGD and Adam reduces the need for learning-rate tuning, making them less sensitive to its choice and further saving compute. Finally, SDE analyses show that adaptive methods are more robust to heavy-tailed gradient noise—commonly observed in practice—helping explain their empirical success.

Selected Publications

2026
[21] On the Role of Batch Size in Stochastic Conditional Gradient Methods.

43rd International Conference on Machine Learning (ICML 2026)
PDF arXiv ICML

We study the role of batch size in stochastic conditional gradient methods under a μ-Kurdyka-Łojasiewicz (μ-KL) condition. Focusing on momentum-based stochastic conditional gradient algorithms (e.g., Scion), we derive a new analysis that explicitly captures the interaction between stepsize, batch size, and stochastic noise. Our study reveals a regime-dependent behavior: increasing the batch size initially improves optimization accuracy but, beyond a critical threshold, the benefits saturate and can eventually degrade performance under a fixed token budget. Notably, the theory predicts the magnitude of the optimal stepsize and aligns well with empirical practices observed in large-scale training. Leveraging these insights, we derive principled guidelines for selecting the batch size and stepsize, and propose an adaptive strategy that increases batch size and sequence length during training while preserving convergence guarantees. Experiments on NanoGPT are consistent with the theoretical predictions and illustrate the emergence of the predicted scaling regimes. Overall, our results provide a theoretical framework for understanding batch size scaling in stochastic conditional gradient methods and offer guidance for designing efficient training schedules in large-scale optimization.

2025
[17]
Enhancing Optimizer Stability: Momentum Adaptation of The NGN Step-size.

39th Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
PDF arXiv NeurIPS Poster

Modern optimization algorithms that incorporate momentum and adaptive step-size offer improved performance in numerous challenging deep learning tasks. However, their effectiveness is often highly sensitive to the choice of hyperparameters, especially the step-size. Tuning these parameters is often difficult, resource-intensive, and time-consuming. Therefore, recent efforts have been directed toward enhancing the stability of optimizers across a wide range of hyperparameter choices [Schaipp et al., 2024]. In this paper, we introduce an algorithm that matches the performance of state-of-the-art optimizers while improving stability to the choice of the step-size hyperparameter through a novel adaptation of the NGN step-size method [Orvieto and Xiao, 2024]. Specifically, we propose a momentum-based version (NGN-M) that attains the standard convergence rate of O(1/\sqrt{K}) under less restrictive assumptions, without the need for interpolation condition or assumptions of bounded stochastic gradients or iterates, in contrast to previous approaches. Additionally, we empirically demonstrate that the combination of the NGN step-size with momentum results in enhanced robustness to the choice of the step-size hyperparameter while delivering performance that is comparable to or surpasses other state-of-the-art optimizers.

2024
[12] Adaptive Methods through the Lens of SDEs: Theoretical Insights on the Role of Noise.

13th International Conference on Learning Representations (ICLR 2025)
PDF arXiv ICLR Poster

Despite the vast empirical evidence supporting the efficacy of adaptive optimization methods in deep learning, their theoretical understanding is far from complete. This work introduces novel SDEs for commonly used adaptive optimizers: SignSGD, RMSprop(W), and Adam(W). These SDEs offer a quantitatively accurate description of these optimizers and help illuminate an intricate relationship between adaptivity, gradient noise, and curvature. Our novel analysis of SignSGD highlights a noteworthy and precise contrast to SGD in terms of convergence speed, stationary distribution, and robustness to heavy-tail noise. We extend this analysis to AdamW and RMSpropW, for which we observe that the role of noise is much more complex. Crucially, we support our theoretical analysis with experimental evidence by verifying our insights: this includes numerically integrating our SDEs using Euler-Maruyama discretization on various neural network architectures such as MLPs, CNNs, ResNets, and Transformers. Our SDEs accurately track the behavior of the respective optimizers, especially when compared to previous SDEs derived for Adam and RMSprop. We believe our approach can provide valuable insights into best training practices and novel scaling rules.