Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2505.21487
Cited By
Hardware-Efficient Attention for Fast Decoding
27 May 2025
Ted Zadouri
Hubert Strauss
Tri Dao
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Hardware-Efficient Attention for Fast Decoding"
9 / 9 papers shown
Title
SeerAttention-R: Sparse Attention Adaptation for Long Reasoning
Yizhao Gao
Shuming Guo
Shijie Cao
Yuqing Xia
Yu Cheng
...
Hayden Kwok-Hay So
Yu Hua
Ting Cao
Fan Yang
Mao Yang
VLM
LRM
23
0
0
10 Jun 2025
Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference
Pol G. Recasens
Ferran Agullo
Yue Zhu
Chen Wang
Eun Kyung Lee
Olivier Tardieu
Jordi Torres
Josep Ll. Berral
92
1
0
11 Mar 2025
Seesaw: High-throughput LLM Inference via Model Re-sharding
Qidong Su
Wei Zhao
Xuelong Li
Muralidhar Andoorveedu
Chenhao Jiang
Zhanda Zhu
Kevin Song
Christina Giannoula
Gennady Pekhimenko
LRM
127
2
0
09 Mar 2025
Slim attention: cut your context memory in half without loss -- K-cache is all you need for MHA
Nils Graef
Matthew Clapp
107
2
0
07 Mar 2025
Towards Economical Inference: Enabling DeepSeek's Multi-Head Latent Attention in Any Transformer-based LLMs
Tao Ji
B. Guo
Y. Wu
Qipeng Guo
Lixing Shen
Zhan Chen
Xipeng Qiu
Qi Zhang
Tao Gui
109
7
0
21 Feb 2025
Rope to Nope and Back Again: A New Hybrid Attention Strategy
Bowen Yang
Bharat Venkitesh
Dwarak Talupuru
Hangyu Lin
David Cairuz
Phil Blunsom
Acyr Locatelli
202
6
0
30 Jan 2025
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI
Daya Guo
Dejian Yang
Haowei Zhang
Junxiao Song
...
Shiyu Wang
S. Yu
Shunfeng Zhou
Shuting Pan
S.S. Li
ReLM
VLM
OffRL
AI4TS
LRM
392
2,028
0
22 Jan 2025
Tensor Product Attention Is All You Need
Yifan Zhang
Yifeng Liu
Huizhuo Yuan
Zhen Qin
Yang Yuan
Q. Gu
Andrew Chi-Chih Yao
224
15
0
11 Jan 2025
Round and Round We Go! What makes Rotary Positional Encodings useful?
Federico Barbero
Alex Vitvitskyi
Christos Perivolaropoulos
Razvan Pascanu
Petar Velickovic
133
29
0
08 Oct 2024
1