Accelerating Blockwise Parallel Language Models with Draft Refinement

Taehyeon Kim; Ananda T. Suresh; Kishore Papineni; Michael Riley; Sanjiv Kumar; Adrian Benton

doi:10.52202/079017-1081

Back to NeurIPS

NeurIPS 2024

Accelerating Blockwise Parallel Language Models with Draft Refinement

Conference Paper Main Conference Track Artificial Intelligence · Machine Learning

PDF Details DOI

Abstract

Autoregressive language models have achieved remarkable advancements, yet their potential is often limited by the slow inference speeds associated with sequential token generation. Blockwise parallel decoding (BPD) was proposed by Stern et al. [42] as a method to improve inference speed of language models by simultaneously predicting multiple future tokens, termed block drafts, which are subsequently verified by the autoregressive model. This paper advances the understanding and improvement of block drafts in two ways. First, we analyze token distributions generated across multiple prediction heads. Second, leveraging these insights, we propose algorithms to improve BPD inference speed by refining the block drafts using task-independent \ngram and neural language models as lightweight rescorers. Experiments demonstrate that by refining block drafts of open-sourced Vicuna and Medusa LLMs, the mean accepted token length are increased by 5-25% relative. This results in over a 3x speedup in wall clock time compared to standard autoregressive decoding in open-source 7B and 13B LLMs.

Accelerating Blockwise Parallel Language Models with Draft Refinement

Abstract

Authors

Keywords

Context