MemeBQ:Memory Efficient Binary Quantization of LLMs

Yuanhui Wang; Kunlong Liu; Minnan Pei; Zhangming Li; Peisong Wang; Qinghao Hu

doi:10.1609/aaai.v40i31.39881

Back to AAAI

AAAI 2026

MemeBQ:Memory Efficient Binary Quantization of LLMs

Conference Paper AAAI Technical Track on Machine Learning VIII Artificial Intelligence

PDF Details DOI

Abstract

Recent years have witnessed growing scholarly interest in binary post-training quantization (PTQ) techniques for large language models (LLMs). While state-of-the-art (SOTA) binary quantization methods significantly reduce memory footprint and computational demands, they introduce additional memory overhead beyond binary weight tensors to mitigate performance degradation. Moreover, binary LLMs still suffer from substantial accuracy loss. To address these limitations, we propose MemeBQ, a novel binary PTQ framework for LLMs that reduces the memory overhead of auxiliary flag bitmaps in existing binary quantization methods. Specifically, we first design a greedy row clustering method, which leverages the similarity between the row vectors of weights to partition the weight rows into different groups. By sharing the common flag bitmap within each row group, we significantly mitigate the memory overhead associated with flag bitmaps. Besides, to improve the performance of binary LLMs, we propose a novel weight splitting method for each row group of weights, which determines the flag bitmap's values in a fine-grained way. Extensive experiments on OPT, Llama-2, and Llama-3 models demonstrate that MemeBQ reduces 50% extra memory demand while achieving comparable accuracy compared with current SOTA methods. Alternatively, MemeBQ outperforms SOTA binary quantization methods up to 7% with the same extra bits on reasoning benchmarks.

MemeBQ:Memory Efficient Binary Quantization of LLMs

Abstract

Authors

Keywords

Context