Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

12 February 2025

Abstract

The attention mechanism is essential for the impressive capabilities of transformer-based Large Language Models (LLMs). However, calculating attention is computationally intensive due to its quadratic dependency on the sequence length. We introduce a novel approach called Top-Theta Attention, or simply Top- $\theta$ , which selectively prunes less essential attention elements by comparing them against carefully calibrated thresholds. This method greatly improves the efficiency of self-attention matrix multiplication while preserving model accuracy, reducing the number of required V cache rows by 3x during generative decoding and the number of attention elements by 10x during the prefill phase. Our method does not require model retraining; instead, it requires only a brief calibration phase to be resilient to distribution shifts, thus not requiring the thresholds for different datasets to be recalibrated. Unlike top-k attention, Top- $\theta$ eliminates full-vector dependency, making it suitable for tiling and scale-out and avoiding costly top-k search. A key innovation of our approach is the development of efficient numerical compensation techniques, which help preserve model accuracy even under aggressive pruning of attention scores.

View on arXiv

@article{berestizshevsky2025_2502.08363,
  title={ Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding },
  author={ Konstantin Berestizshevsky and Renzo Andri and Lukas Cavigelli },
  journal={arXiv preprint arXiv:2502.08363},
  year={ 2025 }
}

Comments on this paper