Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2502.15763
Cited By
Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization
14 February 2025
Bowen Pang
Kai Li
Ruifeng She
Feifan Wang
OffRL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization"
26 / 26 papers shown
Title
Automatic Operator-level Parallelism Planning for Distributed Deep Learning -- A Mixed-Integer Programming Approach
Ruifeng She
Bowen Pang
Kai Li
Zehua Liu
Tao Zhong
186
0
0
12 Mar 2025
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
Wonbeom Lee
Jungi Lee
Junghwan Seo
Jaewoong Sim
RALM
151
164
0
28 Jun 2024
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
Ruoyu Qin
Zheming Li
Weiran He
Mingxing Zhang
Yongwei Wu
Weimin Zheng
Xinran Xu
574
115
0
24 Jun 2024
Llumnix: Dynamic Scheduling for Large Language Model Serving
Biao Sun
Ziming Huang
Hanyu Zhao
Wencong Xiao
Xinyi Zhang
Yong Li
Jialin Li
175
122
0
05 Jun 2024
A Survey on Efficient Inference for Large Language Models
Zixuan Zhou
Xuefei Ning
Ke Hong
Tianyu Fu
Jiaming Xu
...
Shengen Yan
Guohao Dai
Xiao-Ping Zhang
Yuhan Dong
Yu Wang
316
163
0
22 Apr 2024
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Amey Agrawal
Nitin Kedia
Ashish Panwar
Jayashree Mohan
Nipun Kwatra
Bhargav S. Gulavani
Alexey Tumanov
Ramachandran Ramjee
364
331
0
04 Mar 2024
SparseLLM: Towards Global Pruning for Pre-trained Language Models
Guangji Bai
Yijiang Li
Chen Ling
Kibaek Kim
Bo Pan
409
25
0
28 Feb 2024
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
Cunchen Hu
Heyang Huang
Liangliang Xu
Xusheng Chen
Jiang Xu
...
Chenxi Wang
Sa Wang
Yungang Bao
Ninghui Sun
Yizhou Shan
DRL
224
133
0
20 Jan 2024
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
Yinmin Zhong
Shengyu Liu
Junda Chen
Jianbo Hu
Yibo Zhu
Xuanzhe Liu
Xin Jin
Hao Zhang
241
389
0
18 Jan 2024
Fairness in Serving Large Language Models
USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2023
Ying Sheng
Shiyi Cao
Dacheng Li
Banghua Zhu
Zhuohan Li
Danyang Zhuo
Joseph E. Gonzalez
Ion Stoica
280
76
0
31 Dec 2023
Splitwise: Efficient generative LLM inference using phase splitting
International Symposium on Computer Architecture (ISCA), 2023
Pratyush Patel
Esha Choukse
Chaojie Zhang
Aashaka Shah
Íñigo Goiri
Saeed Maleki
Ricardo Bianchini
219
416
0
30 Nov 2023
Ring Attention with Blockwise Transformers for Near-Infinite Context
International Conference on Learning Representations (ICLR), 2023
Hao Liu
Matei A. Zaharia
Pieter Abbeel
515
372
0
03 Oct 2023
Efficient Memory Management for Large Language Model Serving with PagedAttention
Symposium on Operating Systems Principles (SOSP), 2023
Woosuk Kwon
Zhuohan Li
Siyuan Zhuang
Ying Sheng
Lianmin Zheng
Cody Hao Yu
Joseph E. Gonzalez
Haotong Zhang
Ion Stoica
VLM
1.1K
3,913
0
12 Sep 2023
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Joshua Ainslie
James Lee-Thorp
Michiel de Jong
Yury Zemlyanskiy
Federico Lebrón
Sumit Sanghai
338
1,055
0
22 May 2023
Fast Distributed Inference Serving for Large Language Models
Bingyang Wu
Yinmin Zhong
Zili Zhang
Gang Huang
Xuanzhe Liu
Xin Jin
195
137
0
10 May 2023
Accelerating Large Language Model Decoding with Speculative Sampling
Charlie Chen
Sebastian Borgeaud
G. Irving
Jean-Baptiste Lespiau
Laurent Sifre
J. Jumper
BDL
LRM
238
634
0
02 Feb 2023
Fast Inference from Transformers via Speculative Decoding
International Conference on Machine Learning (ICML), 2022
Yaniv Leviathan
Matan Kalman
Yossi Matias
LRM
466
1,086
0
30 Nov 2022
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
Neural Information Processing Systems (NeurIPS), 2022
Tri Dao
Daniel Y. Fu
Stefano Ermon
Atri Rudra
Christopher Ré
VLM
763
3,186
0
27 May 2022
Reducing Activation Recomputation in Large Transformer Models
Conference on Machine Learning and Systems (MLSys), 2022
V. Korthikanti
Jared Casper
Sangkug Lym
Lawrence C. McAfee
M. Andersch
Mohammad Shoeybi
Bryan Catanzaro
AI4CE
257
369
0
10 May 2022
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLM
OffRL
LRM
996
6,547
0
27 Oct 2021
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM
International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2021
Deepak Narayanan
Mohammad Shoeybi
Jared Casper
P. LeGresley
M. Patwary
...
Prethvi Kashinkunti
J. Bernauer
Bryan Catanzaro
Amar Phanishayee
Matei A. Zaharia
MoE
623
936
0
09 Apr 2021
Fast Transformer Decoding: One Write-Head is All You Need
Noam M. Shazeer
512
620
0
06 Nov 2019
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
829
2,329
0
17 Sep 2019
Generating Long Sequences with Sparse Transformers
R. Child
Scott Gray
Alec Radford
Ilya Sutskever
279
2,212
0
23 Apr 2019
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
Benoit Jacob
S. Kligys
Bo Chen
Menglong Zhu
Matthew Tang
Andrew G. Howard
Hartwig Adam
Dmitry Kalenichenko
MQ
585
3,651
0
15 Dec 2017
Distilling the Knowledge in a Neural Network
Geoffrey E. Hinton
Oriol Vinyals
J. Dean
FedML
777
22,113
0
09 Mar 2015
1