Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

17 May 2024

Papers citing "Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers"

5 / 5 papers shown

Title
ATTENTION2D: Communication Efficient Distributed Self-Attention Mechanism Venmugil Elango 43 0 0 20 Mar 2025
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving Zihao Ye Lequn Chen Ruihang Lai Wuwei Lin Yineng Zhang ... Tianqi Chen Baris Kasikci Vinod Grover Arvind Krishnamurthy Luis Ceze 51 19 0 02 Jan 2025
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference Aditya K Kamath Ramya Prabhu Jayashree Mohan Simon Peter R. Ramjee Ashish Panwar 36 9 0 23 Oct 2024
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision Jay Shah Ganesh Bikshandi Ying Zhang Vijay Thakkar Pradeep Ramani Tri Dao 32 112 0 11 Jul 2024
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention Ramya Prabhu Ajay Nayak Jayashree Mohan R. Ramjee Ashish Panwar VLM 34 24 0 07 May 2024