ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2403.11421
  4. Cited By
FastDecode: High-Throughput GPU-Efficient LLM Serving using
  Heterogeneous Pipelines

FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines

18 March 2024
Jiaao He
Jidong Zhai
ArXiv (abs)PDFHTMLGithub (13250★)

Papers citing "FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines"

27 / 27 papers shown
Argus: Quality-Aware High-Throughput Text-to-Image Inference Serving System
Argus: Quality-Aware High-Throughput Text-to-Image Inference Serving System
Shubham Agarwal
Subrata Mitra
Saud Iqbal
DiffMVLM
289
0
0
10 Nov 2025
FlexLink: Boosting your NVLink Bandwidth by 27% without accuracy concern
FlexLink: Boosting your NVLink Bandwidth by 27% without accuracy concern
Ao Shen
Rui Zhang
Junping Zhao
160
1
0
30 Aug 2025
FastTTS: Accelerating Test-Time Scaling for Edge LLM Reasoning
FastTTS: Accelerating Test-Time Scaling for Edge LLM Reasoning
Hao Mark Chen
Zhiwen Mo
Guanxi Lu
Shuang Liang
Lingxiao Ma
Wayne Luk
Hongxiang Fan
LRM
257
0
0
29 Aug 2025
TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving
TokenLake: A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving
Bingyang Wu
Zili Zhang
Yinmin Zhong
Guanzhe Huang
Yibo Zhu
Xuanzhe Liu
Xin Jin
175
3
0
24 Aug 2025
HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs
HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs
Dongquan Yang
Yifan Yang
Xiaotian Yu
Xianbiao Qi
Rong Xiao
MQ
226
0
0
26 Jul 2025
MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts
MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware ExpertsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Wei Tao
Haocheng Lu
Xiaoyang Qu
Bin Zhang
Kai Lu
Jiguang Wan
Jianzong Wang
MQMoE
299
5
0
09 Jun 2025
Kinetics: Rethinking Test-Time Scaling Laws
Kinetics: Rethinking Test-Time Scaling Laws
Ranajoy Sadhukhan
Zhuoming Chen
Haizhong Zheng
Yang Zhou
Emma Strubell
Beidi Chen
491
9
0
05 Jun 2025
Beyond the Buzz: A Pragmatic Take on Inference Disaggregation
Beyond the Buzz: A Pragmatic Take on Inference Disaggregation
Tiyasa Mitra
Ritika Borkar
Nidhi Bhatia
Ramon Matas
Shivam Raj
...
Arpan Dutta
Sailaja Madduri
Dharmesh Jani
Brian Pharris
Bita Darvish Rouhani
397
7
0
05 Jun 2025
Learn from the Past: Fast Sparse Indexing for Large Language Model Decoding
Learn from the Past: Fast Sparse Indexing for Large Language Model Decoding
Feiyu Yao
Qian Wang
282
0
0
30 May 2025
Hardware-Efficient Attention for Fast Decoding
Hardware-Efficient Attention for Fast Decoding
Ted Zadouri
Hubert Strauss
Tri Dao
413
10
0
27 May 2025
Taming the Titans: A Survey of Efficient LLM Inference Serving
Taming the Titans: A Survey of Efficient LLM Inference Serving
Ranran Zhen
Junlin Li
Yixin Ji
Zhiyong Yang
Tong Liu
Qingrong Xia
Xinyu Duan
Zehao Wang
Baoxing Huai
Hao Fei
LLMAG
519
13
0
28 Apr 2025
L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference
L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference
Qingyuan Liu
Liyan Chen
Yanning Yang
Haoyu Wang
Dong Du
Zhigang Mao
Naifeng Jing
Yubin Xia
Haibo Chen
297
1
0
24 Apr 2025
HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing
HPU: High-Bandwidth Processing Unit for Scalable, Cost-effective LLM Inference via GPU Co-processing
Myunghyun Rhee
Joonseop Sim
Taeyoung Ahn
Seungyong Lee
Daegun Yoon
Euiseok Kim
Kyoung Park
Youngpyo Joo
Hosik Kim
233
3
0
18 Apr 2025
Cognitive Memory in Large Language Models
Cognitive Memory in Large Language Models
Lianlei Shan
Shixian Luo
Zezhou Zhu
Yu Yuan
Yong Wu
LLMAGKELM
1.2K
26
0
03 Apr 2025
Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation
Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation
Yunkai Liang
Zhangyu Chen
Pengfei Zuo
Zhi Zhou
Xu Chen
Zhou Yu
287
13
0
26 Mar 2025
LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference
LLMs Know What to Drop: Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference
G. Wang
Shubhangi Upasani
Chen Henry Wu
Darshan Gandhi
Jonathan Li
Changran Hu
Bo Li
Urmish Thakker
285
11
0
11 Mar 2025
Seesaw: High-throughput LLM Inference via Model Re-sharding
Seesaw: High-throughput LLM Inference via Model Re-sharding
Qidong Su
Wei Zhao
Xuelong Li
Muralidhar Andoorveedu
Chenhao Jiang
Zhanda Zhu
Kevin Song
Christina Giannoula
Gennady Pekhimenko
LRM
408
11
0
09 Mar 2025
HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading
HeadInfer: Memory-Efficient LLM Inference by Head-wise Offloading
Cheng Luo
Zefan Cai
Hanshi Sun
Jinqi Xiao
Bo Yuan
Wen Xiao
Junjie Hu
Jiawei Zhao
Beidi Chen
Julius Berner
394
7
0
18 Feb 2025
Tensor Product Attention Is All You Need
Tensor Product Attention Is All You Need
Yifan Zhang
Yifeng Liu
Huizhuo Yuan
Zhen Qin
Yang Yuan
Q. Gu
Andrew Chi-Chih Yao
846
38
0
11 Jan 2025
PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System
PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System
Hyucksung Kwon
Kyungmo Koo
Janghyeon Kim
W. Lee
Minjae Lee
...
Ilkon Kim
Euicheol Lim
John Kim
Jungwook Choi
Jungwook Choi
342
7
0
28 Dec 2024
Deploying Foundation Model Powered Agent Services: A Survey
Deploying Foundation Model Powered Agent Services: A Survey
Wenchao Xu
Jinyu Chen
Peirong Zheng
Xiaoquan Yi
Tianyi Tian
...
Quan Wan
Yining Qi
Yunfeng Fan
Qinliang Su
Xuemin Shen
AI4CE
542
7
0
18 Dec 2024
BlendServe: Optimizing Offline Inference for Auto-regressive Large
  Models with Resource-aware Batching
BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching
Yilong Zhao
Shuo Yang
Kan Zhu
Lianmin Zheng
Baris Kasikci
Yang Zhou
Jiarong Xing
Eric Liang
412
18
0
25 Nov 2024
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM
  Inference
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Xuanlin Jiang
Yang Zhou
Shiyi Cao
Eric Liang
Minlan Yu
275
28
0
02 Nov 2024
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Hanshi Sun
Li-Wen Chang
Yiyuan Ma
Wenlei Bao
Ningxin Zheng
Xin Liu
Harry Dong
Yuejie Chi
Beidi Chen
VLM
554
74
0
28 Oct 2024
InstInfer: In-Storage Attention Offloading for Cost-Effective
  Long-Context LLM Inference
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
Xiurui Pan
Endian Li
Qiao Li
Shengwen Liang
Yizhou Shan
Ke Zhou
Yingwei Luo
Xiaolin Wang
Jie Zhang
234
24
0
08 Sep 2024
Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache
  Consumption
Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption
Shi Luohe
Hongyi Zhang
Yao Yao
Z. Li
Zhao Hai
649
104
0
25 Jul 2024
A Survey on Efficient Inference for Large Language Models
A Survey on Efficient Inference for Large Language Models
Zixuan Zhou
Xuefei Ning
Ke Hong
Tianyu Fu
Jiaming Xu
...
Shengen Yan
Guohao Dai
Xiao-Ping Zhang
Yuhan Dong
Yu Wang
480
200
0
22 Apr 2024
1
Page 1 of 1