DIAA: A Decoding-Efficient Inference Acceleration Approach for On-Device Large Language Models

Hao Tian; Sheng Lu; Fuwen Tian; Guangming Cui; Zheng Li; Xuyun Zhang; Quan Z. Sheng; Wanchun Dou

doi:10.1609/aaai.v40i31.39789

Back to AAAI

AAAI 2026

DIAA: A Decoding-Efficient Inference Acceleration Approach for On-Device Large Language Models

Conference Paper AAAI Technical Track on Machine Learning VIII Artificial Intelligence

PDF Details DOI

Abstract

Large Language Models (LLMs) have revolutionized intelligent interactions, enabling mobile applications such as personal assistants on edge devices for local execution. Speculative decoding (SD) has emerged as a promising paradigm to accelerate LLM inference without compromising generation quality, employing a draft-then-verify manner. However, due to the constrained computing and memory resources on edge devices, existing SD works heavily rely on an auxiliary draft model that incurs additional memory burden and hinders the adaptability, as well as static token trees that yield suboptimal inference performance. To this end, we propose DIAA, a Decoding-efficient Inference Acceleration Approach for on-device LLMs. DIAA achieves plug-and-play and model-agnostic inference speedup with memory and computation efficiency for edge devices. Specifically, a pair of lightweight look-up tables (LUTs) is constructed by Top-K token sampling to cache historical tokens and probabilities for rapid candidate drafting. DIAA integrates a dynamic token tree with prior LUTs enabling paralleled verification, updated during decoding process, to adapt the online context. A computation overlap is then employed to pipeline the update operations of token tree, LUTs, and KV cache to improve the computational efficiency. Finally, through extensive experiments implemented on edge platform NVIDIA Jetson, DIAA outperforms existing baselines in generation speed and inference wall-clock time, while incurring minimal memory overhead.

DIAA: A Decoding-Efficient Inference Acceleration Approach for On-Device Large Language Models

Abstract

Authors

Keywords

Context