ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.15763
  4. Cited By
Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization

Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization

14 February 2025
Bowen Pang
Kai Li
Ruifeng She
Feifan Wang
    OffRL
ArXiv (abs)PDFHTML

Papers citing "Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization"

26 / 26 papers shown
Title
Automatic Operator-level Parallelism Planning for Distributed Deep Learning -- A Mixed-Integer Programming Approach
Ruifeng She
Bowen Pang
Kai Li
Zehua Liu
Tao Zhong
186
0
0
12 Mar 2025
InfiniGen: Efficient Generative Inference of Large Language Models with
  Dynamic KV Cache Management
InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management
Wonbeom Lee
Jungi Lee
Junghwan Seo
Jaewoong Sim
RALM
151
164
0
28 Jun 2024
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
Ruoyu Qin
Zheming Li
Weiran He
Mingxing Zhang
Yongwei Wu
Weimin Zheng
Xinran Xu
574
115
0
24 Jun 2024
Llumnix: Dynamic Scheduling for Large Language Model Serving
Llumnix: Dynamic Scheduling for Large Language Model Serving
Biao Sun
Ziming Huang
Hanyu Zhao
Wencong Xiao
Xinyi Zhang
Yong Li
Jialin Li
175
122
0
05 Jun 2024
A Survey on Efficient Inference for Large Language Models
A Survey on Efficient Inference for Large Language Models
Zixuan Zhou
Xuefei Ning
Ke Hong
Tianyu Fu
Jiaming Xu
...
Shengen Yan
Guohao Dai
Xiao-Ping Zhang
Yuhan Dong
Yu Wang
316
163
0
22 Apr 2024
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Amey Agrawal
Nitin Kedia
Ashish Panwar
Jayashree Mohan
Nipun Kwatra
Bhargav S. Gulavani
Alexey Tumanov
Ramachandran Ramjee
364
331
0
04 Mar 2024
SparseLLM: Towards Global Pruning for Pre-trained Language Models
SparseLLM: Towards Global Pruning for Pre-trained Language Models
Guangji Bai
Yijiang Li
Chen Ling
Kibaek Kim
Bo Pan
409
25
0
28 Feb 2024
Inference without Interference: Disaggregate LLM Inference for Mixed
  Downstream Workloads
Inference without Interference: Disaggregate LLM Inference for Mixed Downstream Workloads
Cunchen Hu
Heyang Huang
Liangliang Xu
Xusheng Chen
Jiang Xu
...
Chenxi Wang
Sa Wang
Yungang Bao
Ninghui Sun
Yizhou Shan
DRL
224
133
0
20 Jan 2024
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized
  Large Language Model Serving
DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving
Yinmin Zhong
Shengyu Liu
Junda Chen
Jianbo Hu
Yibo Zhu
Xuanzhe Liu
Xin Jin
Hao Zhang
241
389
0
18 Jan 2024
Fairness in Serving Large Language Models
Fairness in Serving Large Language ModelsUSENIX Symposium on Operating Systems Design and Implementation (OSDI), 2023
Ying Sheng
Shiyi Cao
Dacheng Li
Banghua Zhu
Zhuohan Li
Danyang Zhuo
Joseph E. Gonzalez
Ion Stoica
280
76
0
31 Dec 2023
Splitwise: Efficient generative LLM inference using phase splitting
Splitwise: Efficient generative LLM inference using phase splittingInternational Symposium on Computer Architecture (ISCA), 2023
Pratyush Patel
Esha Choukse
Chaojie Zhang
Aashaka Shah
Íñigo Goiri
Saeed Maleki
Ricardo Bianchini
219
416
0
30 Nov 2023
Ring Attention with Blockwise Transformers for Near-Infinite Context
Ring Attention with Blockwise Transformers for Near-Infinite ContextInternational Conference on Learning Representations (ICLR), 2023
Hao Liu
Matei A. Zaharia
Pieter Abbeel
515
372
0
03 Oct 2023
Efficient Memory Management for Large Language Model Serving with
  PagedAttention
Efficient Memory Management for Large Language Model Serving with PagedAttentionSymposium on Operating Systems Principles (SOSP), 2023
Woosuk Kwon
Zhuohan Li
Siyuan Zhuang
Ying Sheng
Lianmin Zheng
Cody Hao Yu
Joseph E. Gonzalez
Haotong Zhang
Ion Stoica
VLM
1.1K
3,913
0
12 Sep 2023
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head
  Checkpoints
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head CheckpointsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Joshua Ainslie
James Lee-Thorp
Michiel de Jong
Yury Zemlyanskiy
Federico Lebrón
Sumit Sanghai
338
1,055
0
22 May 2023
Fast Distributed Inference Serving for Large Language Models
Fast Distributed Inference Serving for Large Language Models
Bingyang Wu
Yinmin Zhong
Zili Zhang
Gang Huang
Xuanzhe Liu
Xin Jin
195
137
0
10 May 2023
Accelerating Large Language Model Decoding with Speculative Sampling
Accelerating Large Language Model Decoding with Speculative Sampling
Charlie Chen
Sebastian Borgeaud
G. Irving
Jean-Baptiste Lespiau
Laurent Sifre
J. Jumper
BDLLRM
238
634
0
02 Feb 2023
Fast Inference from Transformers via Speculative Decoding
Fast Inference from Transformers via Speculative DecodingInternational Conference on Machine Learning (ICML), 2022
Yaniv Leviathan
Matan Kalman
Yossi Matias
LRM
466
1,086
0
30 Nov 2022
FlashAttention: Fast and Memory-Efficient Exact Attention with
  IO-Awareness
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-AwarenessNeural Information Processing Systems (NeurIPS), 2022
Tri Dao
Daniel Y. Fu
Stefano Ermon
Atri Rudra
Christopher Ré
VLM
763
3,186
0
27 May 2022
Reducing Activation Recomputation in Large Transformer Models
Reducing Activation Recomputation in Large Transformer ModelsConference on Machine Learning and Systems (MLSys), 2022
V. Korthikanti
Jared Casper
Sangkug Lym
Lawrence C. McAfee
M. Andersch
Mohammad Shoeybi
Bryan Catanzaro
AI4CE
257
369
0
10 May 2022
Training Verifiers to Solve Math Word Problems
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLMOffRLLRM
996
6,547
0
27 Oct 2021
Efficient Large-Scale Language Model Training on GPU Clusters Using
  Megatron-LM
Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LMInternational Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2021
Deepak Narayanan
Mohammad Shoeybi
Jared Casper
P. LeGresley
M. Patwary
...
Prethvi Kashinkunti
J. Bernauer
Bryan Catanzaro
Amar Phanishayee
Matei A. Zaharia
MoE
623
936
0
09 Apr 2021
Fast Transformer Decoding: One Write-Head is All You Need
Fast Transformer Decoding: One Write-Head is All You Need
Noam M. Shazeer
512
620
0
06 Nov 2019
Megatron-LM: Training Multi-Billion Parameter Language Models Using
  Model Parallelism
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
829
2,329
0
17 Sep 2019
Generating Long Sequences with Sparse Transformers
Generating Long Sequences with Sparse Transformers
R. Child
Scott Gray
Alec Radford
Ilya Sutskever
279
2,212
0
23 Apr 2019
Quantization and Training of Neural Networks for Efficient
  Integer-Arithmetic-Only Inference
Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference
Benoit Jacob
S. Kligys
Bo Chen
Menglong Zhu
Matthew Tang
Andrew G. Howard
Hartwig Adam
Dmitry Kalenichenko
MQ
585
3,651
0
15 Dec 2017
Distilling the Knowledge in a Neural Network
Distilling the Knowledge in a Neural Network
Geoffrey E. Hinton
Oriol Vinyals
J. Dean
FedML
777
22,113
0
09 Mar 2015
1