Federated Learning

Main Challenges

  • Communication cost: In federated settings, clients communicate with a central server over limited-bandwidth links; raw gradient transmission is infeasible at scale.
  • Data heterogeneity: Client datasets are non-IID, causing client drift in local-update methods and degrading global convergence.
  • Privacy guarantees: Differential privacy adds noise that interacts non-trivially with gradient clipping and adaptive optimizers, weakening convergence in ways that are poorly understood.
  • Partial participation: Only a random subset of clients participates in each round; standard analyses either ignore this or rely on overly pessimistic bounds.
  • Gradient clipping for privacy: Per-sample clipping required by DP-SGD has complex interactions with momentum and adaptive step-sizes that existing theory does not capture.

My Contributions

Differential Privacy meets Adaptive Optimizers (DP-SDE)

We analyse differentially private adaptive gradient methods via an SDE lens, proving that the privacy noise fundamentally limits the benefit of adaptivity and providing tight bounds that match empirical observations on LLM fine-tuning.

CLIP21 with Momentum

We show that adding momentum to clipped SGD provably accelerates convergence in the federated setting and derive the first tight convergence guarantees for clipped momentum methods under partial participation and data heterogeneity.

CLIP21: Clipping as a Communication Primitive

We reinterpret gradient clipping as a compression mechanism and prove that clipped SGD achieves the same convergence rate as unclipped SGD while dramatically reducing the effective communication cost per round.

Research Impact

The DP-SDE framework provides the first principled explanation of why differentially private training with adaptive optimizers underperforms non-private counterparts, and it guides practitioners in setting clipping thresholds and noise levels for privacy-utility trade-offs. The CLIP21 family of methods is directly applicable to federated fine-tuning of large language models, where communication and privacy constraints are simultaneously binding.

Selected Publications

2026
[19] Adaptive Methods Are Preferable in High Privacy Settings: An SDE Perspective.

14th International Conference on Learning Representations (ICLR 2026)
PDF arXiv ICLR

Differential Privacy (DP) is becoming central to large-scale training as privacy regulations tighten. We revisit how DP noise interacts with adaptivity in optimization through the lens of stochastic differential equations, providing the first SDE-based analysis of private optimizers. Focusing on DP-SGD and DP-SignSGD under per-example clipping, we show a sharp contrast under fixed hyperparameters: DP-SGD converges at a Privacy-Utility Trade-Off of ${\small \mathcal{O}(1/\varepsilon^2)}$ with speed independent of ${\small \varepsilon}$, while DP-SignSGD converges at a speed linear in $\varepsilon$ with an ${\small \mathcal{O}(1/\varepsilon)}$ trade-off, dominating in high-privacy or large batch noise regimes. By contrast, under optimal learning rates, both methods achieve comparable theoretical asymptotic performance; however, the optimal learning rate of DP-SGD scales linearly with ${\small \varepsilon}$, while that of DP-SignSGD is essentially ${\small \varepsilon}$-independent. This makes adaptive methods far more practical, as their hyperparameters transfer across privacy levels with little or no re-tuning. Empirical results confirm our theory across training and test metrics, and empirically extend from DP-SignSGD to DP-Adam.

2025
[13] Double Momentum and Error Feedback for Clipping with Fast Rates and Differential Privacy.

PDF arXiv

Strong Differential Privacy (DP) and Optimization guarantees are two desirable properties for a method in Federated Learning (FL). However, existing algorithms do not achieve both properties at once: they either have optimal DP guarantees but rely on restrictive assumptions such as bounded gradients/bounded data heterogeneity, or they ensure strong optimization performance but lack DP guarantees. To address this gap in the literature, we propose and analyze a new method called Clip21-SGD2M based on a novel combination of clipping, heavy-ball momentum, and Error Feedback. In particular, for non-convex smooth distributed problems with clients having arbitrarily heterogeneous data, we prove that Clip21-SGD2M has optimal convergence rate and also near optimal (local-)DP neighborhood. Our numerical experiments on non-convex logistic regression and training of neural networks highlight the superiority of Clip21-SGD2M over baselines in terms of the optimization performance for a given DP-budget.

2023
[7] Clip21: Error Feedback for Gradient Clipping.

PDF arXiv

Motivated by the increasing popularity and importance of large-scale training under differential privacy (DP) constraints, we study distributed gradient methods with gradient clipping, i.e., clipping applied to the gradients computed from local information at the nodes. While gradient clipping is an essential tool for injecting formal DP guarantees into gradient-based methods [Abadi et al., 2016], it also induces bias which causes serious convergence issues specific to the distributed setting. Inspired by recent progress in the error-feedback literature which is focused on taming the bias/error introduced by communication compression operators such as Top-k [Richtárik et al., 2021], and mathematical similarities between the clipping operator and contractive compression operators, we design Clip21 -- the first provably effective and practically useful error feedback mechanism for distributed methods with gradient clipping. We prove that our method converges at the same ${\tiny \mathcal{O}(1/K)}$ rate as distributed gradient descent in the smooth nonconvex regime, which improves the previous best ${\tiny \mathcal{O}(1/\sqrt{K})}$ rate which was obtained under significantly stronger assumptions. Our method converges significantly faster in practice than competing methods.