27
2

SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization

Abstract

Although quantization for linear layers has been widely used, its application to accelerate the attention process remains limited. To further enhance the efficiency of attention computation compared to SageAttention while maintaining precision, we propose SageAttention2, which utilizes significantly faster 4-bit matrix multiplication (Matmul) alongside additional precision-enhancing techniques. First, we propose to quantize matrices (Q,K)(Q, K) to INT4 in a hardware-friendly thread-level granularity and quantize matrices (P~,V)(\widetilde P, V) to FP8. Second, we propose a method to smooth QQ, enhancing the accuracy of INT4 QKQK^\top. Third, we propose a two-level accumulation strategy for P~V\widetilde PV to enhance the accuracy of FP8 P~V\widetilde PV. The operations per second (OPS) of SageAttention2 surpass FlashAttention2 and xformers by about 3x and 4.5x on RTX4090, respectively. Moreover, SageAttention2 matches the speed of FlashAttention3(fp8) on the Hopper GPUs, while delivering much higher accuracy. Comprehensive experiments confirm that our approach incurs negligible end-to-end metrics loss across diverse models, including those for language, image, and video generation. The code is available atthis https URL.

View on arXiv
@article{zhang2025_2411.10958,
  title={ SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization },
  author={ Jintao Zhang and Haofeng Huang and Pengle Zhang and Jia Wei and Jun Zhu and Jianfei Chen },
  journal={arXiv preprint arXiv:2411.10958},
  year={ 2025 }
}
Comments on this paper