Gradient Multi-Normalization for Efficient LLM Training

Meyer Scetbon; Chao Ma; Wenbo Gong; Ted Meeds

Back to NeurIPS

NeurIPS 2025

Gradient Multi-Normalization for Efficient LLM Training

Conference Paper Main Conference Track Artificial Intelligence · Machine Learning

PDF Details

Abstract

Training large language models (LLMs) commonly relies on adaptive optimizers such as Adam (Kingma & Ba 2015), which accelerate convergence through moment estimates but incur substantial memory overhead. Recent stateless approaches such as SWAN (Ma et al. , 2024) have shown that appropriate preprocessing of instantaneous gradient matrices can match the performance of adaptive methods without storing optimizer states. Building on this insight, we introduce \emph{gradient multi-normalization}, a principled framework for designing stateless optimizers that normalize gradients with respect to multiple norms simultaneously. Whereas standard first-order methods can be viewed as gradient normalization under a single norm (Bernstein & Newhouse, 2024), our formulation generalizes this perspective to a multi-norm setting. We derive an efficient alternating scheme that enforces these normalization constraints and show that our procedure can produce, up to an arbitrary precision, a fixed-point of the problem. This unifies and extends prior stateless optimizers, showing that SWAN arises as a specific instance with particular norm choices. Leveraging this principle, we develop SinkGD, a lightweight matrix optimizer that retains the memory footprint of SGD while substantially reducing computation relative to whitening-based methods. On the memory-efficient LLaMA training benchmark (Zhao et al. , 2024), SinkGD achieves state-of-the-art performance, reaching the same evaluation perplexity as Adam using only 40\% of the training tokens.

Gradient Multi-Normalization for Efficient LLM Training

Abstract

Authors

Keywords

Context