HSR-Enhanced Sparse Attention Acceleration

Large Language Models (LLMs) have demonstrated remarkable capabilities across various applications, but their performance on long-context tasks is often limited by the computational complexity of attention mechanisms. We introduce a novel approach to accelerate attention computation in LLMs, particularly for long-context scenarios. We leverage the inherent sparsity within attention mechanisms, both in conventional Softmax attention and ReLU attention (with activation, ), to significantly reduce the running time complexity. Our method employs a Half-Space Reporting (HSR) data structure to identify non-zero or ``massively activated'' entries in the attention matrix. We present theoretical analyses for two key scenarios: generation decoding and prompt prefilling. Our approach achieves a running time of significantly faster than the naive approach for generation decoding, where is the context length, is the query length, and is the hidden dimension. We can also reduce the running time for prompt prefilling from to . Our method introduces only provably negligible error for Softmax attention. This work represents a significant step towards enabling efficient long-context processing in LLMs.
View on arXiv@article{chen2025_2410.10165, title={ HSR-Enhanced Sparse Attention Acceleration }, author={ Bo Chen and Yingyu Liang and Zhizhou Sha and Zhenmei Shi and Zhao Song }, journal={arXiv preprint arXiv:2410.10165}, year={ 2025 } }