HSR-Enhanced Sparse Attention Acceleration

14 October 2024

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications, but their performance on long-context tasks is often limited by the computational complexity of attention mechanisms. We introduce a novel approach to accelerate attention computation in LLMs, particularly for long-context scenarios. We leverage the inherent sparsity within attention mechanisms, both in conventional Softmax attention and ReLU attention (with $\mathsf{ReLU}^\alpha$ activation, $\alpha \in \mathbb{N}_+$ ), to significantly reduce the running time complexity. Our method employs a Half-Space Reporting (HSR) data structure to identify non-zero or ``massively activated'' entries in the attention matrix. We present theoretical analyses for two key scenarios: generation decoding and prompt prefilling. Our approach achieves a running time of $O(mn^{4/5})$ significantly faster than the naive approach $O(mn)$ for generation decoding, where $n$ is the context length, $m$ is the query length, and $d$ is the hidden dimension. We can also reduce the running time for prompt prefilling from $O(mn)$ to $O(mn^{1 - 1 / \lfloor d/2\rfloor} + mn^{4/5})$ . Our method introduces only provably negligible error for Softmax attention. This work represents a significant step towards enabling efficient long-context processing in LLMs.

View on arXiv

@article{chen2025_2410.10165,
  title={ HSR-Enhanced Sparse Attention Acceleration },
  author={ Bo Chen and Yingyu Liang and Zhizhou Sha and Zhenmei Shi and Zhao Song },
  journal={arXiv preprint arXiv:2410.10165},
  year={ 2025 }
}

Comments on this paper