Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2405.10480
Cited By
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers
17 May 2024
Rya Sanovar
Srikant Bharadwaj
Renée St. Amant
Victor Rühle
Saravan Rajmohan
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers"
5 / 5 papers shown
Title
ATTENTION2D: Communication Efficient Distributed Self-Attention Mechanism
Venmugil Elango
43
0
0
20 Mar 2025
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Zihao Ye
Lequn Chen
Ruihang Lai
Wuwei Lin
Yineng Zhang
...
Tianqi Chen
Baris Kasikci
Vinod Grover
Arvind Krishnamurthy
Luis Ceze
56
19
0
02 Jan 2025
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
Aditya K Kamath
Ramya Prabhu
Jayashree Mohan
Simon Peter
R. Ramjee
Ashish Panwar
41
9
0
23 Oct 2024
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
Jay Shah
Ganesh Bikshandi
Ying Zhang
Vijay Thakkar
Pradeep Ramani
Tri Dao
32
112
0
11 Jul 2024
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
Ramya Prabhu
Ajay Nayak
Jayashree Mohan
R. Ramjee
Ashish Panwar
VLM
39
24
0
07 May 2024
1