Ted Meeds Papers

NeurIPS Conference 2025 Conference Paper

Gradient Multi-Normalization for Efficient LLM Training

Meyer Scetbon
Chao Ma
Wenbo Gong
Ted Meeds

Training large language models (LLMs) commonly relies on adaptive optimizers such as Adam (Kingma & Ba 2015), which accelerate convergence through moment estimates but incur substantial memory overhead. Recent stateless approaches such as SWAN (Ma et al. , 2024) have shown that appropriate preprocessing of instantaneous gradient matrices can match the performance of adaptive methods without storing optimizer states. Building on this insight, we introduce \emph{gradient multi-normalization}, a principled framework for designing stateless optimizers that normalize gradients with respect to multiple norms simultaneously. Whereas standard first-order methods can be viewed as gradient normalization under a single norm (Bernstein & Newhouse, 2024), our formulation generalizes this perspective to a multi-norm setting. We derive an efficient alternating scheme that enforces these normalization constraints and show that our procedure can produce, up to an arbitrary precision, a fixed-point of the problem. This unifies and extends prior stateless optimizers, showing that SWAN arises as a specific instance with particular norm choices. Leveraging this principle, we develop SinkGD, a lightweight matrix optimizer that retains the memory footprint of SGD while substantially reducing computation relative to whitening-based methods. On the memory-efficient LLaMA training benchmark (Zhao et al. , 2024), SinkGD achieves state-of-the-art performance, reaching the same evaluation perplexity as Adam using only 40\% of the training tokens.

PDF Details

NeurIPS Conference 2015 Conference Paper

Optimization Monte Carlo: Efficient and Embarrassingly Parallel Likelihood-Free Inference

Ted Meeds
Max Welling

We describe an embarrassingly parallel, anytime Monte Carlo method for likelihood-free models. The algorithm starts with the view that the stochasticity of the pseudo-samples generated by the simulator can be controlled externally by a vector of random numbers u, in such a way that the outcome, knowing u, is deterministic. For each instantiation of u we run an optimization procedure to minimize the distance between summary statistics of the simulator and the data. After reweighing these samples using the prior and the Jacobian (accounting for the change of volume in transforming from the space of summary statistics to the space of parameters) we show that this weighted ensemble represents a Monte Carlo estimate of the posterior distribution. The procedure can be run embarrassingly parallel (each node handling one sample) and anytime (by allocating resources to the worst performing sample). The procedure is validated on six experiments.

PDF Details

Possible papers

Gradient Multi-Normalization for Efficient LLM Training

Optimization Monte Carlo: Efficient and Embarrassingly Parallel Likelihood-Free Inference