Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2405.05465
Cited By
Vidur: A Large-Scale Simulation Framework For LLM Inference
8 May 2024
Amey Agrawal
Nitin Kedia
Jayashree Mohan
Ashish Panwar
Nipun Kwatra
Bhargav S. Gulavani
R. Ramjee
Alexey Tumanov
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Vidur: A Large-Scale Simulation Framework For LLM Inference"
19 / 19 papers shown
Title
SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models
Hang Wu
Jianian Zhu
Y. Li
Haojie Wang
Biao Hou
Jidong Zhai
25
0
0
12 May 2025
Taming the Titans: A Survey of Efficient LLM Inference Serving
Ranran Zhen
J. Li
Yixin Ji
Z. Yang
Tong Liu
Qingrong Xia
Xinyu Duan
Z. Wang
Baoxing Huai
M. Zhang
LLMAG
77
0
0
28 Apr 2025
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints
Ruicheng Ao
Gan Luo
D. Simchi-Levi
Xinshang Wang
26
2
0
15 Apr 2025
Understanding and Optimizing Multi-Stage AI Inference Pipelines
A. Bambhaniya
Hanjiang Wu
Suvinay Subramanian
S. Srinivasan
Souvik Kundu
Amir Yazdanbakhsh
Midhilesh Elavazhagan
Madhu Kumar
Tushar Krishna
55
0
0
14 Apr 2025
Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents
Yueying Li
Jim Dai
Tianyi Peng
53
1
0
10 Apr 2025
Niyama : Breaking the Silos of LLM Inference Serving
Kanishk Goel
Jayashree Mohan
Nipun Kwatra
Ravi Anupindi
R. Ramjee
45
0
0
28 Mar 2025
AccelGen: Heterogeneous SLO-Guaranteed High-Throughput LLM Inference Serving for Diverse Applications
Haiying Shen
Tanmoy Sen
37
0
0
17 Mar 2025
KVDirect: Distributed Disaggregated LLM Inference
Shiyang Chen
Rain Jiang
Dezhi Yu
Jinlai Xu
Mengyuan Chao
Fanlong Meng
Chenyu Jiang
Wei Xu
Hang Liu
40
1
0
28 Jan 2025
iServe: An Intent-based Serving System for LLMs
Dimitrios Liakopoulos
Tianrui Hu
Prasoon Sinha
N. Yadwadkar
VLM
80
0
0
08 Jan 2025
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference
Aditya K Kamath
Ramya Prabhu
Jayashree Mohan
Simon Peter
R. Ramjee
Ashish Panwar
51
9
0
23 Oct 2024
Revisiting SLO and Goodput Metrics in LLM Serving
Zhibin Wang
Shipeng Li
Yuhang Zhou
Xue Li
Rong Gu
Nguyen Cam-Tu
Chen Tian
Sheng Zhong
21
6
0
18 Oct 2024
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference
Yulei Qian
Fengcun Li
Xiangyang Ji
Xiaoyu Zhao
Jianchao Tan
K. Zhang
Xunliang Cai
MoE
68
2
0
16 Oct 2024
Reducing the Cost of Dropout in Flash-Attention by Hiding RNG with GEMM
Haiyue Ma
Jian Liu
Ronny Krashinsky
16
0
0
10 Oct 2024
Medha: Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations
A. Agrawal
Haoran Qiu
Junda Chen
Íñigo Goiri
Chaojie Zhang
Rayyan Shahid
R. Ramjee
Alexey Tumanov
Esha Choukse
RALM
LRM
30
1
0
25 Sep 2024
LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale
Jaehong Cho
Minsu Kim
Hyunmin Choi
Guseul Heo
Jongse Park
38
8
0
10 Aug 2024
Etalon: Holistic Performance Evaluation Framework for LLM Inference Systems
Amey Agrawal
Anmol Agarwal
Nitin Kedia
Jayashree Mohan
Souvik Kundu
Nipun Kwatra
R. Ramjee
Alexey Tumanov
24
5
0
09 Jul 2024
vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention
Ramya Prabhu
Ajay Nayak
Jayashree Mohan
R. Ramjee
Ashish Panwar
VLM
52
24
0
07 May 2024
Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve
Amey Agrawal
Nitin Kedia
Ashish Panwar
Jayashree Mohan
Nipun Kwatra
Bhargav S. Gulavani
Alexey Tumanov
R. Ramjee
39
147
0
04 Mar 2024
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
243
1,791
0
17 Sep 2019
1