Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2501.01005
Cited By
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
2 January 2025
Zihao Ye
Lequn Chen
Ruihang Lai
Wuwei Lin
Yineng Zhang
Stephanie Wang
Tianqi Chen
Baris Kasikci
Vinod Grover
Arvind Krishnamurthy
Luis Ceze
Re-assign community
ArXiv
PDF
HTML
Papers citing
"FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving"
14 / 14 papers shown
Title
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
Y. Chen
J. Zhang
Baotong Lu
Qianxi Zhang
Chengruidong Zhang
...
Chen Chen
Mingxing Zhang
Yuqing Yang
Fan Yang
Mao Yang
24
0
0
05 May 2025
GPU Performance Portability needs Autotuning
Burkhard Ringlein
Thomas Parnell
Radu Stoica
27
0
0
30 Apr 2025
Efficient LLM Serving on Hybrid Real-time and Best-effort Requests
Wan Borui
Zhao Juntao
Jiang Chenyu
Guo Chuanxiong
Wu Chuan
VLM
31
1
0
13 Apr 2025
Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents
Yueying Li
Jim Dai
Tianyi Peng
31
1
0
10 Apr 2025
BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache
Dayou Du
Shijie Cao
Jianyi Cheng
Ting Cao
M. Yang
MQ
50
0
0
24 Mar 2025
PERCY: Personal Emotional Robotic Conversational System
Zhijin Meng
Mohammed Althubyani
Shengyuan Xie
Imran Razzak
Eduardo Benitez Sandoval
Mahdi Bamdad
Francisco Cruz
34
0
0
04 Mar 2025
Alchemist: Towards the Design of Efficient Online Continual Learning System
Yuyang Huang
Yuhan Liu
Haryadi S. Gunawi
Beibin Li
Changho Hwang
CLL
OnRL
98
0
0
03 Mar 2025
Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs
Kan Zhu
Tian Tang
Qinyu Xu
Yile Gu
Zhichen Zeng
Rohan Kadekodi
Liangyu Zhao
Ang Li
Arvind Krishnamurthy
Baris Kasikci
39
2
0
17 Feb 2025
Twilight: Adaptive Attention Sparsity with Hierarchical Top-
p
p
p
Pruning
C. Lin
Jiaming Tang
Shuo Yang
Hanshuo Wang
Tian Tang
Boyu Tian
Ion Stoica
Song Han
Mingyu Gao
65
2
0
04 Feb 2025
Adaptive Self-improvement LLM Agentic System for ML Library Development
Genghan Zhang
Weixin Liang
Olivia Hsu
K. Olukotun
49
0
0
04 Feb 2025
HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
Ting Sun
Penghan Wang
Fan Lai
38
1
0
15 Jan 2025
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
Aditya K Kamath
Ramya Prabhu
Jayashree Mohan
Simon Peter
R. Ramjee
Ashish Panwar
31
9
0
23 Oct 2024
FlashMask: Efficient and Rich Mask Extension of FlashAttention
Guoxia Wang
Jinle Zeng
Xiyuan Xiao
Siming Wu
Jiabin Yang
Lujing Zheng
Zeyu Chen
Jiang Bian
Dianhai Yu
Haifeng Wang
28
2
0
02 Oct 2024
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
Keisuke Kamahori
Tian Tang
Yile Gu
Kan Zhu
Baris Kasikci
56
17
0
10 Feb 2024
1