We study stochastic conditional gradient methods under a fixed token budget [21] and show that an optimal batch size exists. Our analysis yields predictive convergence rates, enabling joint token-aware scaling laws for batch size, sequence length, Frank-Wolfe stepsize, and momentum as the model size or token horizon grows.
We study the coupling of the NGN stepsize with Polyak momentum [17] and show that incorporating this combination into base optimizers such as SGD and Adam improves robustness to learning-rate choice. In some cases, increasing the learning rate by up to an order of magnitude beyond its optimal value leads to little or no performance degradation, while preserving peak performance.
We study optimization methods such as SGD, SignSGD, and Adam through the lens of stochastic differential equations (SDEs) [12]. These SDEs provide quantitatively accurate models of the algorithms and reveal a nuanced relationship between adaptivity, gradient noise, and curvature. Our analysis of SignSGD highlights a precise contrast with SGD in terms of convergence speed, stationary distribution, and robustness to heavy-tailed noise.
These results improve training efficiency by reducing the need for costly hyperparameter sweeps and tuning, while increasing robustness. BST scaling and NGN step sizes lower computational overhead and sensitivity to learning rates, and SDE insights explain the strong empirical performance of adaptive methods under realistic, heavy-tailed noise.
43rd International Conference on Machine Learning (ICML 2026)
PDF
arXiv
ICML
Abstract
39th Annual Conference on Neural Information Processing Systems (NeurIPS 2025)
PDF
arXiv
NeurIPS
Poster
Abstract
13th International Conference on Learning Representations (ICLR 2025)
PDF
arXiv
ICLR
Poster
Abstract