AAAI Conference 2026 Conference Paper
FedAdamW: A Communication-Efficient Optimizer with Convergence and Generalization Guarantees for Federated Large Models
- Junkang Liu
- Fanhua Shang
- Hongying Liu
- Yuxuan Tian
- Yuanyuan Liu
- Jin Liu
- Kewen Zhu
- Zhouchen Lin
AdamW has become one of the most effective optimizers for training large-scale models. We have also observed its effectiveness in the context of federated learning (FL). However, directly applying AdamW in federated learning settings poses significant challenges: (1) due to data heterogeneity, AdamW often yields high variance in the second-moment estimate v; (2) the local overfitting of AdamW may cause client drift; and (3) Reinitializing moment estimates (v, m) at each round slows down convergence. To address these challenges, we propose the first Federated AdamW algorithm, called FedAdamW, for training and fine-tuning various large models. FedAdamW aligns local updates with the global update using both a local correction mechanism and decoupled weight decay to mitigate local overfitting. FedAdamW efficiently aggregates the mean of the second-moment estimates to reduce their variance and reinitialize them. Theoretically, we prove that FedAdamW achieves a linear speedup convergence rate of O(p(L∆σ2l )/(SKRε2) + (L∆)/R) without heterogeneity assumption, where S is the number of participating clients per round, K is the number of local iterations, and R is the total number of communication rounds. We also employ PAC-Bayesian generalization analysis to explain the effectiveness of decoupled weight decay in local training. Empirically, we validate the effectiveness of FedAdamW on language and vision Transformer models. Compared to several baselines, FedAdamW significantly reduces communication rounds and improves test accuracy.