Mini-Sequence Transformers: Optimizing Intermediate Memory for Long Sequences Training

Cheng Luo; Jiawei Zhao; Zhuoming Chen; Beidi Chen; Anima Anandkumar

doi:10.52202/079017-3086

Back to NeurIPS

NeurIPS 2024

Mini-Sequence Transformers: Optimizing Intermediate Memory for Long Sequences Training

Conference Paper Main Conference Track Artificial Intelligence · Machine Learning

PDF Details DOI

Abstract

We introduce Mini-Sequence Transformer (MsT), a simple and effective methodology for highly efficient and accurate LLM training with extremely long sequences. MsT partitions input sequences and iteratively processes mini-sequences to reduce intermediate memory usage. Integrated with activation recomputation, it enables significant memory savings in both forward and backward passes. In experiments with the Llama3-8B model, with MsT, we measure no degradation in throughput or convergence even with 12x longer sequences than standard implementations. MsT is fully general, implementation-agnostic, and requires minimal code changes to integrate with existing LLM training frameworks. Integrated with the huggingface library, MsT successfully extends the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue: Annual Conference on Neural Information Processing Systems
Archive span: 1987-2025
Indexed papers: 30776
Paper id: 198506185055263547