Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.18677
Cited By
Splitwise: Efficient generative LLM inference using phase splitting
30 November 2023
Pratyush Patel
Esha Choukse
Chaojie Zhang
Aashaka Shah
Íñigo Goiri
Saeed Maleki
Ricardo Bianchini
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Splitwise: Efficient generative LLM inference using phase splitting"
50 / 110 papers shown
Title
MAS-Attention: Memory-Aware Stream Processing for Attention Acceleration on Resource-Constrained Edge Devices
Mohammadali Shakerdargah
Shan Lu
Chao Gao
Di Niu
70
0
0
20 Nov 2024
DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving
Yuhan Liu
Esha Choukse
Shan Lu
Junchen Jiang
Madan Musuvathi
...
Yihua Cheng
Junchen Jiang
Shan Lu
Madan Musuvathi
Esha Choukse
82
2
0
05 Nov 2024
NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference
Xuanlin Jiang
Yang Zhou
Shiyi Cao
Ion Stoica
Minlan Yu
27
6
0
02 Nov 2024
BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batching
Peizhuang Cong
Qizhi Chen
Haochen Zhao
Tong Yang
KELM
15
0
0
24 Oct 2024
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
Aditya K Kamath
Ramya Prabhu
Jayashree Mohan
Simon Peter
R. Ramjee
Ashish Panwar
44
9
0
23 Oct 2024
ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference
Xin He
Shunkang Zhang
Yuxin Wang
Haiyan Yin
Zihao Zeng
Shaohuai Shi
Zhenheng Tang
Xiaowen Chu
Ivor Tsang
Ong Yew Soon
MoE
56
3
0
23 Oct 2024
Is the GPU Half-Empty or Half-Full? Practical Scheduling Techniques for LLMs
Ferdi Kossmann
Bruce Fontaine
Daya Khudia
Michael Cafarella
Samuel Madden
38
1
0
23 Oct 2024
EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models
Junhao Hu
Wenrui Huang
H. Wang
Weidong Wang
Tiancheng Hu
Qin Zhang
Hao Feng
Xusheng Chen
Yizhou Shan
Tao Xie
RALM
LLMAG
20
2
0
20 Oct 2024
Revisiting SLO and Goodput Metrics in LLM Serving
Zhibin Wang
Shipeng Li
Yuhang Zhou
Xue Li
Rong Gu
Nguyen Cam-Tu
Chen Tian
Sheng Zhong
16
6
0
18 Oct 2024
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
Yulei Qian
Fengcun Li
Xiangyang Ji
Xiaoyu Zhao
Jianchao Tan
K. Zhang
Xunliang Cai
MoE
50
2
0
16 Oct 2024
Arrhythmia Classification Using Graph Neural Networks Based on Correlation Matrix
Seungwoo Han
18
0
0
14 Oct 2024
Reducing the Cost of Dropout in Flash-Attention by Hiding RNG with GEMM
Haiyue Ma
Jian Liu
Ronny Krashinsky
16
0
0
10 Oct 2024
A Survey: Collaborative Hardware and Software Design in the Era of Large Language Models
Cong Guo
Feng Cheng
Zhixu Du
James Kiessling
Jonathan Ku
...
Qilin Zheng
Guanglei Zhou
Hai
Li-Wei Li
Yiran Chen
19
5
0
08 Oct 2024
LLMCO2: Advancing Accurate Carbon Footprint Prediction for LLM Inferences
Zhenxiao Fu
Fan Chen
Shan Zhou
Haitong Li
Lei Jiang
19
0
0
03 Oct 2024
ConServe: Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving
Yifan Qiao
Shu Anzai
Shan Yu
Haoran Ma
Yang Wang
Miryung Kim
Harry Xu
26
2
0
02 Oct 2024
Input-Dependent Power Usage in GPUs
Theo Gregersen
Pratyush Patel
Esha Choukse
25
2
0
26 Sep 2024
Medha: Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
A. Agrawal
Haoran Qiu
Junda Chen
Íñigo Goiri
Chaojie Zhang
Rayyan Shahid
R. Ramjee
Alexey Tumanov
Esha Choukse
RALM
LRM
16
1
0
25 Sep 2024
CSPS: A Communication-Efficient Sequence-Parallelism based Serving System for Transformer based Models with Long Prompts
Zeyu Zhang
Haiying Shen
VLM
17
0
0
23 Sep 2024
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
Xiurui Pan
Endian Li
Qiao Li
Shengwen Liang
Yizhou Shan
Ke Zhou
Yingwei Luo
Xiaolin Wang
Jie Zhang
20
10
0
08 Sep 2024
Achieving Peak Performance for Large Language Models: A Systematic Review
Z. R. K. Rostam
Sándor Szénási
Gábor Kertész
19
3
0
07 Sep 2024
Duplex: A Device for Large Language Models with Mixture of Experts, Grouped Query Attention, and Continuous Batching
Sungmin Yun
Kwanhee Kyung
Juhwan Cho
Jaewan Choi
Jongmin Kim
Byeongho Kim
Sukhan Lee
Kyomin Sohn
Jung Ho Ahn
MoE
28
5
0
02 Sep 2024
Efficient LLM Scheduling by Learning to Rank
Yichao Fu
Siqi Zhu
Runlong Su
Aurick Qiao
Ion Stoica
Hao Zhang
35
19
0
28 Aug 2024
Intelligent Router for LLM Workloads: Improving Performance Through Workload-Aware Scheduling
Kunal Jain
Anjaly Parayil
Ankur Mallick
Esha Choukse
Xiaoting Qin
...
Chetan Bansal
Victor Rühle
Anoop Kulkarni
Steve Kofsky
Saravan Rajmohan
23
3
0
24 Aug 2024
P/D-Serve: Serving Disaggregated Large Language Model at Scale
Yibo Jin
Tao Wang
Huimin Lin
Mingyang Song
Peiyang Li
...
Haoliang Cheng
Xiaojing Li
Jiandong Ding
Hefei Guo
Zhengyong Zhang
MoE
16
8
0
15 Aug 2024
Kraken: Inherently Parallel Transformers For Efficient Multi-Device Inference
R. Prabhakar
Hengrui Zhang
D. Wentzlaff
21
0
0
14 Aug 2024
LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference
Zhiwen Mo
Lei Wang
Jianyu Wei
Zhichen Zeng
Shijie Cao
...
Naifeng Jing
Ting Cao
Jilong Xue
Fan Yang
Mao Yang
45
4
0
12 Aug 2024
Post-Training Sparse Attention with Double Sparsity
Shuo Yang
Ying Sheng
Joseph E. Gonzalez
Ion Stoica
Lianmin Zheng
18
0
0
11 Aug 2024
LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale
Jaehong Cho
Minsu Kim
Hyunmin Choi
Guseul Heo
Jongse Park
25
8
0
10 Aug 2024
DynamoLLM: Designing LLM Inference Clusters for Performance and Energy Efficiency
Jovan Stojkovic
Chaojie Zhang
Íñigo Goiri
Josep Torrellas
Esha Choukse
30
2
0
01 Aug 2024
Improving GPU Multi-Tenancy Through Dynamic Multi-Instance GPU Reconfiguration
Tianyu Wang
Sheng R. Li
Bingyao Li
Yuezhen Dai
Ao Li
Geng Yuan
Yufei Ding
Youtao Zhang
Xulong Tang
17
0
0
18 Jul 2024
LLM Inference Serving: Survey of Recent Advances and Opportunities
Baolin Li
Yankai Jiang
V. Gadepally
Devesh Tiwari
64
15
0
17 Jul 2024
Etalon: Holistic Performance Evaluation Framework for LLM Inference Systems
Amey Agrawal
Anmol Agarwal
Nitin Kedia
Jayashree Mohan
Souvik Kundu
Nipun Kwatra
R. Ramjee
Alexey Tumanov
16
5
0
09 Jul 2024
Offline Energy-Optimal LLM Serving: Workload-Based Energy Models for LLM Inference on Heterogeneous Systems
Grant Wilkins
Srinivasan Keshav
Richard Mortier
21
4
0
04 Jul 2024
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
Ruoyu Qin
Zheming Li
Weiran He
Mingxing Zhang
Yongwei Wu
Weimin Zheng
Xinran Xu
32
51
0
24 Jun 2024
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
Chaofan Lin
Zhenhua Han
Chengruidong Zhang
Yuqing Yang
Fan Yang
Chen Chen
Lili Qiu
71
35
0
30 May 2024
A Declarative System for Optimizing AI Workloads
Chunwei Liu
Matthew Russo
Michael Cafarella
Lei Cao
Peter Baille Chen
Zui Chen
Michael Franklin
Tim Kraska
Samuel Madden
Gerardo Vitagliano
32
20
0
23 May 2024
EdgeShard: Efficient LLM Inference via Collaborative Edge Computing
Mingjin Zhang
Jiannong Cao
Xiaoming Shen
Zeyang Cui
25
45
0
23 May 2024
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression
Peiyu Liu
Zeming Gao
Wayne Xin Zhao
Yipeng Ma
Tao Wang
Ji-Rong Wen
MQ
14
4
0
21 May 2024
Vidur: A Large-Scale Simulation Framework For LLM Inference
Amey Agrawal
Nitin Kedia
Jayashree Mohan
Ashish Panwar
Nipun Kwatra
Bhargav S. Gulavani
R. Ramjee
Alexey Tumanov
VLM
25
38
0
08 May 2024
KV-Runahead: Scalable Causal LLM Inference by Parallel Key-Value Cache Generation
Minsik Cho
Mohammad Rastegari
Devang Naik
16
4
0
08 May 2024
Critical Infrastructure Protection: Generative AI, Challenges, and Opportunities
Yagmur Yigit
M. Ferrag
Iqbal H. Sarker
Leandros A. Maglaras
Christos Chrysoulas
Naghmeh Moradpoor
Helge Janicke
19
5
0
08 May 2024
Preble: Efficient Distributed Prompt Scheduling for LLM Serving
Vikranth Srivatsa
Zijian He
Reyna Abhyankar
Dongming Li
Yiying Zhang
40
17
0
08 May 2024
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
Ramya Prabhu
Ajay Nayak
Jayashree Mohan
R. Ramjee
Ashish Panwar
VLM
44
24
0
07 May 2024
Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity
Tyler Griggs
Xiaoxuan Liu
Jiaxiang Yu
Doyoung Kim
Wei-Lin Chiang
Alvin Cheung
Ion Stoica
27
14
0
22 Apr 2024
A Survey on Efficient Inference for Large Language Models
Zixuan Zhou
Xuefei Ning
Ke Hong
Tianyu Fu
Jiaming Xu
...
Shengen Yan
Guohao Dai
Xiao-Ping Zhang
Yuhan Dong
Yu-Xiang Wang
46
78
0
22 Apr 2024
LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism
Bingya Wu
Shengyu Liu
Yinmin Zhong
Peng Sun
Xuanzhe Liu
Xin Jin
RALM
22
49
0
15 Apr 2024
Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts
Weilin Cai
Juyong Jiang
Le Qin
Junwei Cui
Sunghun Kim
Jiayi Huang
42
7
0
07 Apr 2024
Towards Greener LLMs: Bringing Energy-Efficiency to the Forefront of LLM Inference
Jovan Stojkovic
Esha Choukse
Chaojie Zhang
Inigo Goiri
Josep Torrellas
25
33
0
29 Mar 2024
Cost-Efficient Large Language Model Serving for Multi-turn Conversations with CachedAttention
Bin Gao
Zhuomin He
Puru Sharma
Qingxuan Kang
Djordje Jevdjic
Junbo Deng
Xingkun Yang
Zhou Yu
Pengfei Zuo
61
42
0
23 Mar 2024
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Amey Agrawal
Nitin Kedia
Ashish Panwar
Jayashree Mohan
Nipun Kwatra
Bhargav S. Gulavani
Alexey Tumanov
R. Ramjee
31
147
0
04 Mar 2024
Previous
1
2
3
Next