We study the convergence of stochastic conditional gradient methods under a fixed token budget. Our analysis shows that there exists an optimal batch size that yields the best performance. Moreover, the resulting convergence rates are predictive, enabling us to derive joint token-aware scaling laws for batch size, sequence length, Frank–Wolfe stepsize, and momentum parameters as the model size or token horizon grows.
We study the coupling of the NGN stepsize with Polyak momentum. Incorporating this combination into base optimizers such as SGD and Adam improves their robustness to the choice of learning rate. In some cases, increasing the learning rate by up to an order of magnitude beyond its optimal value results in little to no performance degradation. Importantly, the peak performance of the base optimizer is preserved.
We study optimization algorithms such as SGD, Sign SGD, and Adam through the lens of stochastic differential equations (SDE). These SDEs offer a quantitatively accurate description of these optimizers and help illuminate an intricate relationship between adaptivity, gradient noise, and curvature. Our novel analysis of SignSGD highlights a noteworthy and precise contrast to SGD in terms of convergence speed, stationary distribution, and robustness to heavy-tail noise.
BST scaling reduces resource use by eliminating exhaustive sweeps for empirical scaling rules. It also improves training efficiency by adapting hyperparameters to the token horizon. Incorporating NGN step sizes into base algorithms such as SGD and Adam reduces the need for learning-rate tuning, making them less sensitive to its choice and further saving compute. Finally, SDE analyses show that adaptive methods are more robust to heavy-tailed gradient noise—commonly observed in practice—helping explain their empirical success.
43rd International Conference on Machine Learning (ICML 2026)
PDF
arXiv
ICML
Abstract
39th Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
PDF
arXiv
NeurIPS
Poster
Abstract
13th International Conference on Learning Representations (ICLR 2025)
PDF
arXiv
ICLR
Poster
Abstract