We analyse differentially private adaptive gradient methods via an SDE lens, proving that the privacy noise fundamentally limits the benefit of adaptivity and providing tight bounds that match empirical observations on LLM fine-tuning.
We show that adding momentum to clipped SGD provably accelerates convergence in the federated setting and derive the first tight convergence guarantees for clipped momentum methods under partial participation and data heterogeneity.
We reinterpret gradient clipping as a compression mechanism and prove that clipped SGD achieves the same convergence rate as unclipped SGD while dramatically reducing the effective communication cost per round.
The DP-SDE framework provides the first principled explanation of why differentially private training with adaptive optimizers underperforms non-private counterparts, and it guides practitioners in setting clipping thresholds and noise levels for privacy-utility trade-offs. The CLIP21 family of methods is directly applicable to federated fine-tuning of large language models, where communication and privacy constraints are simultaneously binding.
14th International Conference on Learning Representations (ICLR 2026)
PDF
arXiv
ICLR
Abstract