Sim-LLM: Optimizing LLM Inference at the Edge through Inter-Task KV Reuse

Ruikun Luo; Changwei Gu; Qiang He; Feifei Chen; Song Wu; Hai Jin; Yun Yang

Back to NeurIPS

NeurIPS 2025

Sim-LLM: Optimizing LLM Inference at the Edge through Inter-Task KV Reuse

Conference Paper Main Conference Track Artificial Intelligence · Machine Learning

PDF Details

Abstract

KV cache technology, by storing key-value pairs, helps reduce the computational overhead incurred by large language models (LLMs). It facilitates their deployment on resource-constrained edge computing nodes like edge servers. However, as the complexity and size of tasks increase, KV cache usage leads to substantial GPU memory consumption. Existing research has focused on mitigating KV cache memory usage through sequence length reduction, task-specific compression, and dynamic eviction policies. However, these methods are computationally expensive for resource-constrained edge computing nodes. To tackle this challenge, this paper presents Sim-LLM, a novel inference optimization mechanism that leverages task similarity to reduce KV cache memory consumption for LLMs. By caching KVs from processed tasks and reusing them for subsequent similar tasks during inference, Sim-LLM significantly reduces memory consumption while boosting system throughput and increasing maximum batch size, all with minimal accuracy degradation. Evaluated on both A40 and A100 GPUs, Sim-LLM achieves a system throughput improvement of up to 39. 40\% and a memory reduction of up to 34. 65%, compared to state-of-the-art approaches. Our source code is available at https: //github. com/CGCL-codes/SimLLM.

Sim-LLM: Optimizing LLM Inference at the Edge through Inter-Task KV Reuse

Abstract

Authors

Keywords

Context