Arrow Research search

Author name cluster

Marcin Chochowski

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

3 papers
2 author rows

Possible papers

3

NeurIPS Conference 2025 Conference Paper

Efficient Hybrid Language Model Compression through Group-Aware SSM Pruning

  • Ali Taghibakhshi
  • Sharath Turuvekere Sreenivas
  • Saurav Muralidharan
  • Marcin Chochowski
  • Yashaswi Karnati
  • Raviraj Joshi
  • Ameya Mahabaleshwarkar
  • ZIJIA CHEN

Hybrid language models that combine Attention and State Space Models (SSMs) have been shown to achieve state-of-the-art accuracy and runtime performance. Recent work has also demonstrated that applying pruning and distillation to Attention-only models yields smaller, more accurate models at a fraction of the training cost. In this work, we explore the effectiveness of compressing Hybrid architectures. To this end, we introduce a novel group-aware pruning method for Mamba layers that preserves the structural integrity of SSM blocks and their sequence modeling capabilities. We combine this method with FFN, embedding dimension, and layer pruning, along with knowledge distillation-based retraining to obtain a unified compression recipe for hybrid models. Using this recipe, we compress the Nemotron-H 8B Hybrid model down to 4B parameters with up to $40\times$ fewer training tokens compared to similarly-sized models. The resulting model surpasses the accuracy of similarly-sized models while achieving $\sim2\times$ faster inference throughput, significantly advancing the Pareto frontier.

NeurIPS Conference 2024 Conference Paper

Compact Language Models via Pruning and Knowledge Distillation

  • Saurav Muralidharan
  • Sharath Turuvekere Sreenivas
  • Raviraj Joshi
  • Marcin Chochowski
  • Mostofa Patwary
  • Mohammad Shoeybi
  • Bryan Catanzaro
  • Jan Kautz

Large language models (LLMs) targeting different deployment scales and sizes are currently produced by training each variant from scratch; this is extremely compute-intensive. In this paper, we investigate if pruning an existing LLM and then re-training it with a fraction <3% of the original training data can be a suitable alternative to repeated, full retraining. To this end, we develop a set of practical and effective compression best practices for LLMs that combine depth, width, attention and MLP pruning with knowledge distillation-based retraining; we arrive at these best practices through a detailed empirical exploration of pruning strategies for each axis, methods to combine axes, distillation strategies, and search techniques for arriving at optimal compressed architectures. We use this guide to compress the Nemotron-4 family of LLMs by a factor of 2-4x, and compare their performance to similarly-sized models on a variety of language modeling tasks. On these tasks, we perform better than Nemotron-3 8B and LLaMa2 7B using up to 40x fewer training tokens}, on par with Mistral 7B and Gemma 7B using up to 85x fewer tokens and slightly worse than LLaMa3 8B using up to 159x fewer tokens. Our models also compare favorably to state-of-the-art compression techniques from the literature.

ICML Conference 2024 Conference Paper

Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

  • Piotr Nawrot
  • Adrian Lancucki
  • Marcin Chochowski
  • David Tarjan
  • Edoardo M. Ponti

Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key–value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for on-line key–value cache compression at inference time. Most importantly, the model learns to apply different compression ratios in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to $\sim 3. 7 \times$ throughput increase during auto-regressive inference on an NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. We find that DMC preserves the original downstream performance with up to 4$\times$ cache compression, outperforming up-trained grouped-query attention (GQA) and key–value eviction policies (H$_2$O, TOVA). GQA and DMC can be even combined to obtain compounded gains. As a result DMC fits longer contexts and larger batches within any given memory budget. We release the DMC code and models at https: //github. com/NVIDIA/Megatron-LM/tree/DMC.