SeerAttention-R: Sparse Attention Adaptation for Long Reasoning

10 June 2025

Abstract

We introduce SeerAttention-R, a sparse attention framework specifically tailored for the long decoding of reasoning models. Extended from SeerAttention, SeerAttention-R retains the design of learning attention sparsity through a self-distilled gating mechanism, while removing query pooling to accommodate auto-regressive decoding. With a lightweight plug-in gating, SeerAttention-R is flexible and can be easily integrated into existing pretrained model without modifying the original parameters. We demonstrate that SeerAttention-R, trained on just 0.4B tokens, maintains near-lossless reasoning accuracy with 4K token budget in AIME benchmark under large sparse attention block sizes (64/128). Using TileLang, we develop a highly optimized sparse decoding kernel that achieves near-theoretical speedups of up to 9x over FlashAttention-3 on H100 GPU at 90% sparsity. Code is available at:this https URL.

View on arXiv

@article{gao2025_2506.08889,
  title={ SeerAttention-R: Sparse Attention Adaptation for Long Reasoning },
  author={ Yizhao Gao and Shuming Guo and Shijie Cao and Yuqing Xia and Yu Cheng and Lei Wang and Lingxiao Ma and Yutao Sun and Tianzhu Ye and Li Dong and Hayden Kwok-Hay So and Yu Hua and Ting Cao and Fan Yang and Mao Yang },
  journal={arXiv preprint arXiv:2506.08889},
  year={ 2025 }
}

Main:12 Pages

10 Figures

Bibliography:6 Pages

2 Tables

Comments on this paper