Sheng Lu Papers

AAAI Conference 2026 Conference Paper

DIAA: A Decoding-Efficient Inference Acceleration Approach for On-Device Large Language Models

Hao Tian
Sheng Lu
Fuwen Tian
Guangming Cui
Zheng Li
Xuyun Zhang
Quan Z. Sheng
Wanchun Dou

Large Language Models (LLMs) have revolutionized intelligent interactions, enabling mobile applications such as personal assistants on edge devices for local execution. Speculative decoding (SD) has emerged as a promising paradigm to accelerate LLM inference without compromising generation quality, employing a draft-then-verify manner. However, due to the constrained computing and memory resources on edge devices, existing SD works heavily rely on an auxiliary draft model that incurs additional memory burden and hinders the adaptability, as well as static token trees that yield suboptimal inference performance. To this end, we propose DIAA, a Decoding-efficient Inference Acceleration Approach for on-device LLMs. DIAA achieves plug-and-play and model-agnostic inference speedup with memory and computation efficiency for edge devices. Specifically, a pair of lightweight look-up tables (LUTs) is constructed by Top-K token sampling to cache historical tokens and probabilities for rapid candidate drafting. DIAA integrates a dynamic token tree with prior LUTs enabling paralleled verification, updated during decoding process, to adapt the online context. A computation overlap is then employed to pipeline the update operations of token tree, LUTs, and KV cache to improve the computational efficiency. Finally, through extensive experiments implemented on edge platform NVIDIA Jetson, DIAA outperforms existing baselines in generation speed and inference wall-clock time, while incurring minimal memory overhead.

PDF Details DOI

Sheng Lu

Possible papers

DIAA: A Decoding-Efficient Inference Acceleration Approach for On-Device Large Language Models

Dual-conditional feature constraints based tri-component decoupling domain adaptation network for cross-machine fault diagnosis