Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.18677
Cited By
Splitwise: Efficient generative LLM inference using phase splitting
30 November 2023
Pratyush Patel
Esha Choukse
Chaojie Zhang
Aashaka Shah
Íñigo Goiri
Saeed Maleki
Ricardo Bianchini
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Splitwise: Efficient generative LLM inference using phase splitting"
50 / 110 papers shown
Title
PipeSpec: Breaking Stage Dependencies in Hierarchical LLM Decoding
Bradley McDanel
S. Zhang
Y. Hu
Zining Liu
MoE
27
0
0
02 May 2025
Ascendra: Dynamic Request Prioritization for Efficient LLM Serving
Azam Ikram
Xiang Li
Sameh Elnikety
S. Bagchi
47
0
0
29 Apr 2025
Taming the Titans: A Survey of Efficient LLM Inference Serving
Ranran Zhen
J. Li
Yixin Ji
Z. Yang
Tong Liu
Qingrong Xia
Xinyu Duan
Z. Wang
Baoxing Huai
M. Zhang
LLMAG
77
0
0
28 Apr 2025
GenTorrent: Scaling Large Language Model Serving with An Overley Network
Fei Fang
Yifan Hua
Shengze Wang
Ruilin Zhou
Y. Liu
Chen Qian
X. Zhang
46
0
0
27 Apr 2025
Tempo: Application-aware LLM Serving with Mixed SLO Requirements
Wei Zhang
Zhiyu Wu
Yi Mu
Banruo Liu
Myungjin Lee
Fan Lai
51
0
0
24 Apr 2025
StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation
Yinmin Zhong
Zili Zhang
Xiaoniu Song
Hanpeng Hu
Chao Jin
...
Changyi Wan
Hongyu Zhou
Yimin Jiang
Yibo Zhu
Daxin Jiang
OffRL
AI4TS
49
0
0
22 Apr 2025
Splitwiser: Efficient LM inference with constrained resources
Asad Aali
Adney Cardoza
Melissa Capo
10
0
0
21 Apr 2025
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints
Ruicheng Ao
Gan Luo
D. Simchi-Levi
Xinshang Wang
18
2
0
15 Apr 2025
Understanding and Optimizing Multi-Stage AI Inference Pipelines
A. Bambhaniya
Hanjiang Wu
Suvinay Subramanian
S. Srinivasan
Souvik Kundu
Amir Yazdanbakhsh
Midhilesh Elavazhagan
Madhu Kumar
Tushar Krishna
34
0
0
14 Apr 2025
HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
Avinash Kumar
Shashank Nag
Jason Clemons
L. John
Poulami Das
24
0
0
14 Apr 2025
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints
Yichao Yuan
Lin Ma
Nishil Talati
MoE
57
0
0
12 Apr 2025
MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications
Aashaka Shah
Abhinav Jangda
B. Li
Caio Rocha
Changho Hwang
...
Peng Cheng
Qinghua Zhou
Roshan Dathathri
Saeed Maleki
Ziyue Yang
GNN
47
0
0
11 Apr 2025
Apt-Serve: Adaptive Request Scheduling on Hybrid Cache for Scalable LLM Inference Serving
Shihong Gao
X. Zhang
Yanyan Shen
Lei Chen
17
1
0
10 Apr 2025
Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents
Yueying Li
Jim Dai
Tianyi Peng
36
1
0
10 Apr 2025
Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching
Yanhao Dong
Yubo Miao
Weinan Li
Xiao Zheng
Chao Wang
Feng Lyu
19
0
0
08 Apr 2025
SLOs-Serve: Optimized Serving of Multi-SLO LLMs
Siyuan Chen
Zhipeng Jia
S. Khan
Arvind Krishnamurthy
Phillip B. Gibbons
19
1
0
05 Apr 2025
FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling
Weiqing Li
Guochao Jiang
Xiangyong Ding
Zhangcheng Tao
Chuzhan Hao
Chenfeng Xu
Yuewei Zhang
Hao Wang
21
0
0
03 Apr 2025
A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond
Xiaoye Qu
Yafu Li
Zhaochen Su
Weigao Sun
Jianhao Yan
...
Chaochao Lu
Yue Zhang
Xian-Sheng Hua
Bowen Zhou
Yu Cheng
ReLM
OffRL
LRM
76
11
0
27 Mar 2025
Injecting Adrenaline into LLM Serving: Boosting Resource Utilization and Throughput via Attention Disaggregation
Yunkai Liang
Zhangyu Chen
Pengfei Zuo
Zhi Zhou
Xu Chen
Zhou Yu
76
2
0
26 Mar 2025
SplitFrozen: Split Learning with Device-side Model Frozen for Fine-Tuning LLM on Heterogeneous Resource-Constrained Devices
Jian Ma
Xinchen Lyu
Jun Jiang
Qimei Cui
Haipeng Yao
Xiaofeng Tao
36
0
0
23 Mar 2025
RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
Wenqi Jiang
Suvinay Subramanian
Cat Graves
Gustavo Alonso
Amir Yazdanbakhsh
Vidushi Dadu
41
5
0
18 Mar 2025
AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications
Haiying Shen
Tanmoy Sen
32
0
0
17 Mar 2025
MoE-Gen: High-Throughput MoE Inference on a Single GPU with Module-Based Batching
Tairan Xu
Leyang Xue
Zhan Lu
Adrian Jackson
Luo Mai
MoE
80
1
0
12 Mar 2025
Sometimes Painful but Certainly Promising: Feasibility and Trade-offs of Language Model Inference at the Edge
Maximilian Abstreiter
Sasu Tarkoma
Roberto Morabito
39
0
0
12 Mar 2025
Queueing, Predictions, and LLMs: Challenges and Open Problems
Michael Mitzenmacher
Rana Shahout
AI4TS
LRM
31
1
0
10 Mar 2025
Green Prompting
Marta Adamska
Daria Smirnova
Hamid Nasiri
Zhengxin Yu
Peter Garraghan
58
0
0
09 Mar 2025
Beyond Decoder-only: Large Language Models Can be Good Encoders for Machine Translation
Yingfeng Luo
Tong Zheng
Yongyu Mu
B. Li
Qinghong Zhang
...
Ziqiang Xu
Peinan Feng
Xiaoqian Liu
Tong Xiao
Jingbo Zhu
AI4CE
59
0
0
09 Mar 2025
Seesaw: High-throughput LLM Inference via Model Re-sharding
Qidong Su
Wei Zhao
X. Li
Muralidhar Andoorveedu
Chenhao Jiang
Zhanda Zhu
Kevin Song
Christina Giannoula
Gennady Pekhimenko
LRM
68
0
0
09 Mar 2025
SpecServe: Efficient and SLO-Aware Large Language Model Serving with Adaptive Speculative Decoding
Kaiyu Huang
Hao Wu
Zhubo Shi
Han Zou
Minchen Yu
Qingjiang Shi
LRM
36
0
0
07 Mar 2025
Alchemist: Towards the Design of Efficient Online Continual Learning System
Yuyang Huang
Yuhan Liu
Haryadi S. Gunawi
Beibin Li
Changho Hwang
CLL
OnRL
98
0
0
03 Mar 2025
SRAG: Structured Retrieval-Augmented Generation for Multi-Entity Question Answering over Wikipedia Graph
Teng Lin
Yizhang Zhu
Yuyu Luo
Nan Tang
RALM
3DV
39
0
0
03 Mar 2025
Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving
Qihui Zhou
Peiqi Yin
Pengfei Zuo
James Cheng
CLL
32
1
0
01 Mar 2025
MEBench: Benchmarking Large Language Models for Cross-Document Multi-Entity Question Answering
Teng Lin
RALM
61
2
0
26 Feb 2025
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
Yintao He
Haiyu Mao
Christina Giannoula
Mohammad Sadrosadati
Juan Gómez Luna
Huawei Li
Xiaowei Li
Ying Wang
O. Mutlu
33
5
0
21 Feb 2025
Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization
Bowen Pang
Kai Li
Ruifeng She
Feifan Wang
OffRL
38
2
0
14 Feb 2025
fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving
Hanfei Yu
Xingqi Cui
H. M. Zhang
H. Wang
Hao Wang
MoE
49
0
0
07 Feb 2025
KVDirect: Distributed Disaggregated LLM Inference
Shiyang Chen
Rain Jiang
Dezhi Yu
Jinlai Xu
Mengyuan Chao
Fanlong Meng
Chenyu Jiang
Wei Xu
Hang Liu
40
1
0
28 Jan 2025
AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding
Zikun Li
Zhuofu Chen
Remi Delacourt
Gabriele Oliaro
Zeyu Wang
...
Zhihao Zhang
Zhuoming Chen
Sean Lai
Xupeng Miao
Zhihao Jia
47
5
0
21 Jan 2025
HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
Ting Sun
Penghan Wang
Fan Lai
46
1
0
15 Jan 2025
iServe: An Intent-based Serving System for LLMs
Dimitrios Liakopoulos
Tianrui Hu
Prasoon Sinha
N. Yadwadkar
VLM
51
0
0
08 Jan 2025
TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms
Jovan Stojkovic
Chaojie Zhang
Íñigo Goiri
Esha Choukse
Haoran Qiu
Rodrigo Fonseca
Josep Torrellas
Ricardo Bianchini
32
4
0
05 Jan 2025
RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval
Di Liu
Meng Chen
Baotong Lu
Huiqiang Jiang
Zhenhua Han
...
K. Zhang
C. L. P. Chen
Fan Yang
Y. Yang
Lili Qiu
37
29
0
03 Jan 2025
Efficiently Serving LLM Reasoning Programs with Certaindex
Yichao Fu
Junda Chen
Siqi Zhu
Zheyu Fu
Zhongdongming Dai
Aurick Qiao
Hao Zhang
LRM
46
12
0
31 Dec 2024
LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System
Hyucksung Kwon
Kyungmo Koo
Janghyeon Kim
W. Lee
Minjae Lee
...
Yongkee Kwon
Ilkon Kim
Euicheol Lim
John Kim
Jungwook Choi
46
4
0
28 Dec 2024
SYMPHONY: Improving Memory Management for LLM Inference Workloads
Saurabh Agarwal
Anyong Mao
Aditya Akella
Shivaram Venkataraman
LLMAG
75
0
0
21 Dec 2024
Deploying Foundation Model Powered Agent Services: A Survey
Wenchao Xu
Jinyu Chen
Peirong Zheng
Xiaoquan Yi
Tianyi Tian
...
Quan Wan
Haozhao Wang
Yunfeng Fan
Qinliang Su
Xuemin Shen
AI4CE
110
1
0
18 Dec 2024
Accelerating Retrieval-Augmented Generation
Derrick Quinn
Mohammad Nouri
Neel Patel
John Salihu
Alireza Salemi
Sukhan Lee
Hamed Zamani
Mohammad Alian
RALM
3DV
78
2
0
14 Dec 2024
ClusterKV: Manipulating LLM KV Cache in Semantic Space for Recallable Compression
Guangda Liu
C. Li
Jieru Zhao
Chenqi Zhang
M. Guo
57
8
0
04 Dec 2024
BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching
Yilong Zhao
Shuo Yang
Kan Zhu
Lianmin Zheng
Baris Kasikci
Yang Zhou
Jiarong Xing
Ion Stoica
106
5
0
25 Nov 2024
Ensuring Fair LLM Serving Amid Diverse Applications
Redwan Ibne Seraj Khan
Kunal Jain
Haiying Shen
Ankur Mallick
Anjaly Parayil
...
Yue Cheng
A. R. Butt
Victor Rühle
Chetan Bansal
Saravan Rajmohan
66
0
0
24 Nov 2024
1
2
3
Next