Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2403.11421
Cited By
FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines
18 March 2024
Jiaao He
Jidong Zhai
Re-assign community
ArXiv (abs)
PDF
HTML
Github (13250★)
Papers citing
"FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines"
27 / 27 papers shown
Argus: Quality-Aware High-Throughput Text-to-Image Inference Serving System
Shubham Agarwal
Subrata Mitra
Saud Iqbal
DiffM
VLM
289
0
0
10 Nov 2025
FlexLink: Boosting your NVLink Bandwidth by 27% without accuracy concern
Ao Shen
Rui Zhang
Junping Zhao
160
1
0
30 Aug 2025
FastTTS: Accelerating Test-Time Scaling for Edge LLM Reasoning
Hao Mark Chen
Zhiwen Mo
Guanxi Lu
Shuang Liang
Lingxiao Ma
Wayne Luk
Hongxiang Fan
LRM
257
0
0
29 Aug 2025
TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving
Bingyang Wu
Zili Zhang
Yinmin Zhong
Guanzhe Huang
Yibo Zhu
Xuanzhe Liu
Xin Jin
175
3
0
24 Aug 2025
HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs
Dongquan Yang
Yifan Yang
Xiaotian Yu
Xianbiao Qi
Rong Xiao
MQ
226
0
0
26 Jul 2025
MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Wei Tao
Haocheng Lu
Xiaoyang Qu
Bin Zhang
Kai Lu
Jiguang Wan
Jianzong Wang
MQ
MoE
299
5
0
09 Jun 2025
Kinetics: Rethinking Test-Time Scaling Laws
Ranajoy Sadhukhan
Zhuoming Chen
Haizhong Zheng
Yang Zhou
Emma Strubell
Beidi Chen
491
9
0
05 Jun 2025
Beyond the Buzz: A Pragmatic Take on Inference Disaggregation
Tiyasa Mitra
Ritika Borkar
Nidhi Bhatia
Ramon Matas
Shivam Raj
...
Arpan Dutta
Sailaja Madduri
Dharmesh Jani
Brian Pharris
Bita Darvish Rouhani
397
7
0
05 Jun 2025
Learn from the Past: Fast Sparse Indexing for Large Language Model Decoding
Feiyu Yao
Qian Wang
282
0
0
30 May 2025
Hardware-Efficient Attention for Fast Decoding
Ted Zadouri
Hubert Strauss
Tri Dao
413
10
0
27 May 2025
Taming the Titans: A Survey of Efficient LLM Inference Serving
Ranran Zhen
Junlin Li
Yixin Ji
Zhiyong Yang
Tong Liu
Qingrong Xia
Xinyu Duan
Zehao Wang
Baoxing Huai
Hao Fei
LLMAG
519
13
0
28 Apr 2025
L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference
Qingyuan Liu
Liyan Chen
Yanning Yang
Haoyu Wang
Dong Du
Zhigang Mao
Naifeng Jing
Yubin Xia
Haibo Chen
297
1
0
24 Apr 2025
HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing
Myunghyun Rhee
Joonseop Sim
Taeyoung Ahn
Seungyong Lee
Daegun Yoon
Euiseok Kim
Kyoung Park
Youngpyo Joo
Hosik Kim
233
3
0
18 Apr 2025
Cognitive Memory in Large Language Models
Lianlei Shan
Shixian Luo
Zezhou Zhu
Yu Yuan
Yong Wu
LLMAG
KELM
1.2K
26
0
03 Apr 2025
Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation
Yunkai Liang
Zhangyu Chen
Pengfei Zuo
Zhi Zhou
Xu Chen
Zhou Yu
287
13
0
26 Mar 2025
LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference
G. Wang
Shubhangi Upasani
Chen Henry Wu
Darshan Gandhi
Jonathan Li
Changran Hu
Bo Li
Urmish Thakker
285
11
0
11 Mar 2025
Seesaw: High-throughput LLM Inference via Model Re-sharding
Qidong Su
Wei Zhao
Xuelong Li
Muralidhar Andoorveedu
Chenhao Jiang
Zhanda Zhu
Kevin Song
Christina Giannoula
Gennady Pekhimenko
LRM
408
11
0
09 Mar 2025
HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading
Cheng Luo
Zefan Cai
Hanshi Sun
Jinqi Xiao
Bo Yuan
Wen Xiao
Junjie Hu
Jiawei Zhao
Beidi Chen
Julius Berner
394
7
0
18 Feb 2025
Tensor Product Attention Is All You Need
Yifan Zhang
Yifeng Liu
Huizhuo Yuan
Zhen Qin
Yang Yuan
Q. Gu
Andrew Chi-Chih Yao
846
38
0
11 Jan 2025
PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System
Hyucksung Kwon
Kyungmo Koo
Janghyeon Kim
W. Lee
Minjae Lee
...
Ilkon Kim
Euicheol Lim
John Kim
Jungwook Choi
Jungwook Choi
342
7
0
28 Dec 2024
Deploying Foundation Model Powered Agent Services: A Survey
Wenchao Xu
Jinyu Chen
Peirong Zheng
Xiaoquan Yi
Tianyi Tian
...
Quan Wan
Yining Qi
Yunfeng Fan
Qinliang Su
Xuemin Shen
AI4CE
542
7
0
18 Dec 2024
BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching
Yilong Zhao
Shuo Yang
Kan Zhu
Lianmin Zheng
Baris Kasikci
Yang Zhou
Jiarong Xing
Eric Liang
412
18
0
25 Nov 2024
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Xuanlin Jiang
Yang Zhou
Shiyi Cao
Eric Liang
Minlan Yu
275
28
0
02 Nov 2024
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Hanshi Sun
Li-Wen Chang
Yiyuan Ma
Wenlei Bao
Ningxin Zheng
Xin Liu
Harry Dong
Yuejie Chi
Beidi Chen
VLM
554
74
0
28 Oct 2024
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
Xiurui Pan
Endian Li
Qiao Li
Shengwen Liang
Yizhou Shan
Ke Zhou
Yingwei Luo
Xiaolin Wang
Jie Zhang
234
24
0
08 Sep 2024
Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption
Shi Luohe
Hongyi Zhang
Yao Yao
Z. Li
Zhao Hai
649
104
0
25 Jul 2024
A Survey on Efficient Inference for Large Language Models
Zixuan Zhou
Xuefei Ning
Ke Hong
Tianyu Fu
Jiaming Xu
...
Shengen Yan
Guohao Dai
Xiao-Ping Zhang
Yuhan Dong
Yu Wang
480
200
0
22 Apr 2024
1
Page 1 of 1