NeurIPS 2024
S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training
Abstract
Training deep neural networks (DNNs) is costly. Fortunately, Nvidia Ampere and Hopper GPUs can accelerate matrix multiplications twice as fast as a dense equivalent by implementing 2: 4 sparsity. However, previous STE-based 2: 4 pre-training methods (\eg~STE with hard-thresholding, SR-STE) suffer from optimization difficulties because of discontinuous pruning function. In this study, we comprehensively analyse the bottleneck of traditional N: M sparse training and recognize three drawbacks with discontinuity: incorrect descending direction, inability to predict the amount of descent and sparse mask oscillation. In the light of this statement, we propose S-STE, a simple yet powerful 2: 4 training method that contains two parts: to continuously project weights to be 2: 4 sparse, and to rescale sparse weights with a per-tensor fixed scaling factor. Besides, we adopt minimum-variance unbiased estimation for activation gradient and FP8 quantization for whole process. Results show that our method surpass previous 2: 4 pre-training recipes and is comparable even with full parameter models.
Authors
Keywords
No keywords are indexed for this paper.
Context
- Venue
- Annual Conference on Neural Information Processing Systems
- Archive span
- 1987-2025
- Indexed papers
- 30776
- Paper id
- 506516766535823455