Accelerating Transformer Inference and Training with 2:4 Activation Sparsity

20 March 2025

Abstract

In this paper, we demonstrate how to leverage 2:4 sparsity, a popular hardware-accelerated GPU sparsity pattern, to activations to accelerate large language model training and inference. Crucially we exploit the intrinsic sparsity found in Squared-ReLU activations to provide this acceleration with no accuracy loss. Our approach achieves up to 1.3x faster Feed Forward Network (FFNs) in both the forwards and backwards pass. This work highlights the potential for sparsity to play a key role in accelerating large language model training and inference.

View on arXiv

@article{haziza2025_2503.16672,
  title={ Accelerating Transformer Inference and Training with 2:4 Activation Sparsity },
  author={ Daniel Haziza and Timothy Chou and Dhruv Choudhary and Luca Wehrstedt and Francisco Massa and Jiecao Yu and Geonhwa Jeong and Supriya Rao and Patrick Labatut and Jesse Cai },
  journal={arXiv preprint arXiv:2503.16672},
  year={ 2025 }
}

Comments on this paper