TIST Journal 2026 Journal Article
BUZZ: Beehive-structured Sparse KV Cache with Segmented Heavy Hitters for Efficient LLM Inference
- Junqi Zhao
- Zhijin Fang
- Shu Li
- Shaohui Yang
- Shichao He
Large language models (LLMs) are critical in natural language processing but face challenges in inference speed and computational efficiency, hindering real-time applications. The key-value (KV) cache mechanism helps reduce computational overhead in transformer models; however, efficient contextual understanding remains problematic. In this paper, we introduce BUZZ, an innovative KV caching algorithm that leverages structured contextual information to optimize cache memory usage while enhancing inference speed. The core concept of BUZZ involves interval sampling of historically significant tokens to maintain sentence structure information, ensuring that KV Cache historical tokens are consistently distributed at nearly equal intervals. Tokens recently removed from the sliding window undergo local-max sampling based on attention values, preserving crucial contextual information. Additionally, we propose BUZZ with \(\log n\), an extension that enhances performance under extreme compression and long-context settings. Evaluations on four real-world datasets—CNN/Daily Mail, XSUM, LongBench, Wikitext, and 10-QA—demonstrate that BUZZ (1) achieves a 2.5 \(\times\) reduction in cache memory usage for LLM inference while maintaining over 99% accuracy in long-text summarization, and (2) surpasses state-of-the-art multi-document question answering by 7.69% under equivalent memory constraints, avoiding out-of-memory issues faced by full cache approaches. Furthermore, BUZZ achieves substantial inference speed improvements with a \(\log{n}\) time complexity. The implementation of BUZZ is available at: https://github.com/JunqiZhao888/buzz-llm.