ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.08363
71
1

Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

12 February 2025
Konstantin Berestizshevsky
Renzo Andri
Lukas Cavigelli
ArXivPDFHTML
Abstract

The attention mechanism is essential for the impressive capabilities of transformer-based Large Language Models (LLMs). However, calculating attention is computationally intensive due to its quadratic dependency on the sequence length. We introduce a novel approach called Top-Theta Attention, or simply Top-θ\thetaθ, which selectively prunes less essential attention elements by comparing them against carefully calibrated thresholds. This method greatly improves the efficiency of self-attention matrix multiplication while preserving model accuracy, reducing the number of required V cache rows by 3x during generative decoding and the number of attention elements by 10x during the prefill phase. Our method does not require model retraining; instead, it requires only a brief calibration phase to be resilient to distribution shifts, thus not requiring the thresholds for different datasets to be recalibrated. Unlike top-k attention, Top-θ\thetaθ eliminates full-vector dependency, making it suitable for tiling and scale-out and avoiding costly top-k search. A key innovation of our approach is the development of efficient numerical compensation techniques, which help preserve model accuracy even under aggressive pruning of attention scores.

View on arXiv
@article{berestizshevsky2025_2502.08363,
  title={ Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding },
  author={ Konstantin Berestizshevsky and Renzo Andri and Lukas Cavigelli },
  journal={arXiv preprint arXiv:2502.08363},
  year={ 2025 }
}
Comments on this paper