ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2501.01005
  4. Cited By
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving

2 January 2025
Zihao Ye
Lequn Chen
Ruihang Lai
Wuwei Lin
Yineng Zhang
Stephanie Wang
Tianqi Chen
Baris Kasikci
Vinod Grover
Arvind Krishnamurthy
Luis Ceze
ArXivPDFHTML

Papers citing "FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving"

14 / 14 papers shown
Title
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
Y. Chen
J. Zhang
Baotong Lu
Qianxi Zhang
Chengruidong Zhang
...
Chen Chen
Mingxing Zhang
Yuqing Yang
Fan Yang
Mao Yang
24
0
0
05 May 2025
GPU Performance Portability needs Autotuning
GPU Performance Portability needs Autotuning
Burkhard Ringlein
Thomas Parnell
Radu Stoica
27
0
0
30 Apr 2025
Efficient LLM Serving on Hybrid Real-time and Best-effort Requests
Efficient LLM Serving on Hybrid Real-time and Best-effort Requests
Wan Borui
Zhao Juntao
Jiang Chenyu
Guo Chuanxiong
Wu Chuan
VLM
31
1
0
13 Apr 2025
Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents
Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents
Yueying Li
Jim Dai
Tianyi Peng
31
1
0
10 Apr 2025
BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache
BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache
Dayou Du
Shijie Cao
Jianyi Cheng
Ting Cao
M. Yang
MQ
50
0
0
24 Mar 2025
PERCY: Personal Emotional Robotic Conversational System
PERCY: Personal Emotional Robotic Conversational System
Zhijin Meng
Mohammed Althubyani
Shengyuan Xie
Imran Razzak
Eduardo Benitez Sandoval
Mahdi Bamdad
Francisco Cruz
34
0
0
04 Mar 2025
Alchemist: Towards the Design of Efficient Online Continual Learning System
Yuyang Huang
Yuhan Liu
Haryadi S. Gunawi
Beibin Li
Changho Hwang
CLL
OnRL
98
0
0
03 Mar 2025
Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs
Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs
Kan Zhu
Tian Tang
Qinyu Xu
Yile Gu
Zhichen Zeng
Rohan Kadekodi
Liangyu Zhao
Ang Li
Arvind Krishnamurthy
Baris Kasikci
39
2
0
17 Feb 2025
Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning
Twilight: Adaptive Attention Sparsity with Hierarchical Top-ppp Pruning
C. Lin
Jiaming Tang
Shuo Yang
Hanshuo Wang
Tian Tang
Boyu Tian
Ion Stoica
Song Han
Mingyu Gao
65
2
0
04 Feb 2025
Adaptive Self-improvement LLM Agentic System for ML Library Development
Adaptive Self-improvement LLM Agentic System for ML Library Development
Genghan Zhang
Weixin Liang
Olivia Hsu
K. Olukotun
49
0
0
04 Feb 2025
HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
Ting Sun
Penghan Wang
Fan Lai
38
1
0
15 Jan 2025
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM
  Inference
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
Aditya K Kamath
Ramya Prabhu
Jayashree Mohan
Simon Peter
R. Ramjee
Ashish Panwar
31
9
0
23 Oct 2024
FlashMask: Efficient and Rich Mask Extension of FlashAttention
FlashMask: Efficient and Rich Mask Extension of FlashAttention
Guoxia Wang
Jinle Zeng
Xiyuan Xiao
Siming Wu
Jiabin Yang
Lujing Zheng
Zeyu Chen
Jiang Bian
Dianhai Yu
Haifeng Wang
28
2
0
02 Oct 2024
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
Fiddler: CPU-GPU Orchestration for Fast Inference of Mixture-of-Experts Models
Keisuke Kamahori
Tian Tang
Yile Gu
Kan Zhu
Baris Kasikci
56
17
0
10 Feb 2024
1