NeurIPS 2025
Gradient Multi-Normalization for Efficient LLM Training
Abstract
Training large language models (LLMs) commonly relies on adaptive optimizers such as Adam (Kingma & Ba 2015), which accelerate convergence through moment estimates but incur substantial memory overhead. Recent stateless approaches such as SWAN (Ma et al. , 2024) have shown that appropriate preprocessing of instantaneous gradient matrices can match the performance of adaptive methods without storing optimizer states. Building on this insight, we introduce \emph{gradient multi-normalization}, a principled framework for designing stateless optimizers that normalize gradients with respect to multiple norms simultaneously. Whereas standard first-order methods can be viewed as gradient normalization under a single norm (Bernstein & Newhouse, 2024), our formulation generalizes this perspective to a multi-norm setting. We derive an efficient alternating scheme that enforces these normalization constraints and show that our procedure can produce, up to an arbitrary precision, a fixed-point of the problem. This unifies and extends prior stateless optimizers, showing that SWAN arises as a specific instance with particular norm choices. Leveraging this principle, we develop SinkGD, a lightweight matrix optimizer that retains the memory footprint of SGD while substantially reducing computation relative to whitening-based methods. On the memory-efficient LLaMA training benchmark (Zhao et al. , 2024), SinkGD achieves state-of-the-art performance, reaching the same evaluation perplexity as Adam using only 40\% of the training tokens.
Authors
Keywords
No keywords are indexed for this paper.
Context
- Venue
- Annual Conference on Neural Information Processing Systems
- Archive span
- 1987-2025
- Indexed papers
- 30776
- Paper id
- 785164929578804044