CAT: Compression-Aware Training for bandwidth reduction

Chaim Baskin; Brian Chmiel; Evgenii Zheltonozhskii; Ron Banner; Alex M. Bronstein; Avi Mendelson

Back to JMLR

JMLR 2021

CAT: Compression-Aware Training for bandwidth reduction

Journal Article Articles Artificial Intelligence · Machine Learning

PDF Details

Abstract

One major obstacle hindering the ubiquitous use of CNNs for inference is their relatively high memory bandwidth requirements, which can be the primary energy consumer and throughput bottleneck in hardware accelerators. Inspired by quantization-aware training approaches, we propose a compression-aware training (CAT) method that involves training the model to allow better compression of weights and feature maps during neural network deployment. Our method trains the model to achieve low-entropy feature maps, enabling efficient compression at inference time using classical transform coding methods. CAT significantly improves the state-of-the-art results reported for quantization evaluated on various vision and NLP tasks, such as image classification (ImageNet), image detection (Pascal VOC), sentiment analysis (CoLa), and textual entailment (MNLI). For example, on ResNet-18, we achieve near baseline ImageNet accuracy with an average representation of only 1.5 bits per value with 5-bit quantization. Moreover, we show that entropy reduction of weights and activations can be applied together, further improving bandwidth reduction. Reference implementation is available. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2021. ( edit, beta )

CAT: Compression-Aware Training for bandwidth reduction

Abstract

Authors

Keywords

Context