Nesterov Method for Asynchronous Pipeline Parallel Optimization

Thalaiyasingam Ajanthan; Sameera Ramasinghe; Yan Zuo; Gil Avraham; Alexander Long

Back to ICML

ICML 2025

Nesterov Method for Asynchronous Pipeline Parallel Optimization

Conference Paper Accept (poster) Artificial Intelligence · Machine Learning

Details

Abstract

Pipeline Parallelism (PP) enables large neural network training on small, interconnected devices by splitting the model into multiple stages. To maximize pipeline utilization, asynchronous optimization is appealing as it offers 100% pipeline utilization by construction. However, it is inherently challenging as the weights and gradients are no longer synchronized, leading to stale (or delayed) gradients. To alleviate this, we introduce a variant of Nesterov Accelerated Gradient (NAG) for asynchronous optimization in PP. Specifically, we modify the look-ahead step in NAG to effectively address the staleness in gradients. We theoretically prove that our approach converges at a sublinear rate in the presence of fixed delay in gradients. Our experiments on large-scale language modelling tasks using decoder-only architectures with up to 1B parameters, demonstrate that our approach significantly outperforms existing asynchronous methods, even surpassing the synchronous baseline.

Authors

Keywords

Asynchronous Optimization
Pipeline Parallelism
Nesterov Method
Convergence Analysis
Decentralized Training
Protocol Learning

Context

Venue: International Conference on Machine Learning
Archive span: 1993-2025
Indexed papers: 16471
Paper id: 539763185937709372