MeCeFO: Enhancing LLM Training Robustness via Fault-Tolerant Optimization

Rizhen Hu; Yutong He; Ran Yan; Mou Sun; Binhang Yuan; Kun Yuan

Back to NeurIPS

NeurIPS 2025

MeCeFO: Enhancing LLM Training Robustness via Fault-Tolerant Optimization

Conference Paper Main Conference Track Artificial Intelligence · Machine Learning

PDF Details

Abstract

As distributed optimization scales to meet the demands of Large Language Model (LLM) training, hardware failures become increasingly non-negligible. Existing fault-tolerant training methods often introduce significant computational or memory overhead, demanding additional resources. To address this challenge, we propose **Me**mory- and **C**omputation- **e**fficient **F**ault-tolerant **O**ptimization (**MeCeFO**), a novel algorithm that ensures robust training with minimal overhead. When a computing node fails, MeCeFO seamlessly transfers its training task to a neighboring node while employing memory- and computation-efficient algorithmic optimizations to minimize the extra workload imposed on the neighboring node handling both tasks. MeCeFO leverages three key algorithmic designs: (i) Skip-connection, which drops the multi-head attention (MHA) module during backpropagation for memory- and computation-efficient approximation; (ii) Recomputation, which reduces activation memory in feedforward networks (FFNs); and (iii) Low-rank gradient approximation, enabling efficient estimation of FFN weight matrix gradients. Theoretically, MeCeFO matches the convergence rate of conventional distributed training, with a rate of $\mathcal{O}(1/\sqrt{nT})$, where $n$ is the data parallelism size and $T$ is the number of iterations. Empirically, MeCeFO maintains robust performance under high failure rates, incurring only a 4. 18\% drop in throughput, demonstrating $5. 0\times$ to $6. 7\times$ greater resilience than previous SOTA approaches.

MeCeFO: Enhancing LLM Training Robustness via Fault-Tolerant Optimization

Abstract

Authors

Keywords

Context