Arrow Research search
Back to NeurIPS

NeurIPS 2025

TSENOR: Highly-Efficient Algorithm for Finding Transposable N:M Sparse Masks

Conference Paper Main Conference Track Artificial Intelligence · Machine Learning

Abstract

Network pruning reduces computational requirements of large neural networks, with N: M sparsity—retaining only N out of every M consecutive weights—offering a compelling balance between compressed model quality and hardware acceleration. However, N: M sparsity only accelerates forward-pass computations, as N: M patterns are not preserved during matrix transposition, limiting efficiency during training where both passes are computationally intensive. While transposable N: M sparsity has been proposed to address this limitation, existing methods for finding transposable N: M sparse masks either fail to scale to large models or are restricted to M=4 which results in suboptimal compression-accuracy trade-off. We introduce an efficient solver for transposable N: M masks that scales to billion-parameter models. We formulate mask generation as optimal transport problems and solve through entropy regularization and Dykstra's algorithm, followed by a rounding procedure. Our tensor-based implementation exploits GPU parallelism, achieving up to 100× speedup with only 1-10\% error compared to existing methods. Our approach can be integrated with layer-wise N: M pruning frameworks including Wanda, SparseGPT and ALPS to produce transposable N: M sparse models with arbitrary N: M values. Experiments show that LLaMA3. 2-8B with transposable 16: 32 sparsity maintains performance close to its standard N: M counterpart and outperforms standard 2: 4 sparse model, showing the practical value of our approach. Our code is available at https: //github. com/mazumder-lab/TSENOR.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue
Annual Conference on Neural Information Processing Systems
Archive span
1987-2025
Indexed papers
30776
Paper id
257205149410975973