ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2308.16369
  4. Cited By
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked
  Prefills

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

31 August 2023
Amey Agrawal
Ashish Panwar
Jayashree Mohan
Nipun Kwatra
Bhargav S. Gulavani
Ramachandran Ramjee
    AI4TSLRM
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)Github (56282★)

Papers citing "SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills"

50 / 85 papers shown
MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm
MindGPT-4ov: An Enhanced MLLM via a Multi-Stage Post-Training Paradigm
Wei Chen
Chaoqun Du
Feng Gu
Wei He
Qizhen Li
...
Pengfei Yu
Y. Zheng
Chunpeng Zhou
Pan Zhou
Xuhan Zhu
MLLMOffRLVLM
733
7
0
02 Dec 2025
KV Pareto: Systems-Level Optimization of KV Cache and Model Compression for Long Context Inference
KV Pareto: Systems-Level Optimization of KV Cache and Model Compression for Long Context Inference
Sai Gokhale
Devleena Das
Rajeev Patwari
Ashish Sirasao
Elliott Delaye
MQ
452
0
0
01 Dec 2025
DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving
DSD: A Distributed Speculative Decoding Solution for Edge-Cloud Agile Large Model Serving
Fengze Yu
Leshu Li
Brad McDanel
Sai Qian Zhang
329
2
0
26 Nov 2025
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators
Jonathan Li
Nasim Farahini
Evgenii Iuliugin
Magnus Vesterlund
Christian Haggstrom
...
Mingran Wang
Qinghua Li
Bo Li
Urmish Thakker
R. Prabhakar
VLM
413
1
0
05 Nov 2025
From Principles to Practice: A Systematic Study of LLM Serving on Multi-core NPUs
From Principles to Practice: A Systematic Study of LLM Serving on Multi-core NPUs
Tianhao Zhu
Dahu Feng
Erhu Feng
Yubin Xia
174
1
0
07 Oct 2025
VeriLLM: A Lightweight Framework for Publicly Verifiable Decentralized Inference
VeriLLM: A Lightweight Framework for Publicly Verifiable Decentralized Inference
Ke Wang
Zishuo Zhao
Xinyuan Song
Bill Shi
Libin Xia
Chris Tong
Lynn Ai
Felix Qu
Eric Yang
Lynn Ai
348
0
0
29 Sep 2025
LongCat-Flash Technical Report
LongCat-Flash Technical Report
M-A-P Team
Bayan
Bei Li
Bingye Lei
Bo Wang
...
Rongxiang Weng
Ruichen Shao
Rumei Li
Shizhe Wu
Shuai Liang
MLLMMoEVLM
527
32
0
01 Sep 2025
Adaptively Robust LLM Inference Optimization under Prediction Uncertainty
Adaptively Robust LLM Inference Optimization under Prediction Uncertainty
Zixi Chen
Yinyu Ye
Zijie Zhou
167
4
0
20 Aug 2025
P/D-Device: Disaggregated Large Language Model between Cloud and Devices
P/D-Device: Disaggregated Large Language Model between Cloud and Devices
Yibo Jin
Yixu Xu
Yue-ting Chen
C. Wang
Tao Wang
...
Zhe Wang
Hefei Guo
Hongjie Liu
Wei Lu
Zhengyong Zhang
273
1
0
12 Aug 2025
LLM Serving Optimization with Variable Prefill and Decode Lengths
LLM Serving Optimization with Variable Prefill and Decode Lengths
Meixuan Wang
Yinyu Ye
Zijie Zhou
190
5
0
08 Aug 2025
Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling
Block: Balancing Load in LLM Serving with Context, Knowledge and Predictive Scheduling
Wei Da
Evangelia Kalyvianaki
241
0
0
05 Aug 2025
Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling
Forecasting LLM Inference Performance via Hardware-Agnostic Analytical Modeling
Rajeev Patwari
Ashish Sirasao
Devleena Das
241
4
0
29 Jul 2025
ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism
ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism
Zedong Liu
Shenggan Cheng
Guangming Tan
Yang You
Dingwen Tao
618
5
0
14 Jul 2025
Symbiosis: Multi-Adapter Inference and Fine-Tuning
Symbiosis: Multi-Adapter Inference and Fine-Tuning
Saransh Gupta
Umesh Deshpande
Travis Janssen
Swami Sundararaman
MoE
417
0
0
03 Jul 2025
Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?
Cache Me If You Can: How Many KVs Do You Need for Effective Long-Context LMs?
Adithya Bhaskar
Alexander Wettig
Tianyu Gao
Yihe Dong
Danqi Chen
301
8
0
20 Jun 2025
Beyond the Buzz: A Pragmatic Take on Inference Disaggregation
Beyond the Buzz: A Pragmatic Take on Inference Disaggregation
Tiyasa Mitra
Ritika Borkar
Nidhi Bhatia
Ramon Matas
Shivam Raj
...
Arpan Dutta
Sailaja Madduri
Dharmesh Jani
Brian Pharris
Bita Darvish Rouhani
399
7
0
05 Jun 2025
Rectified Sparse Attention
Rectified Sparse Attention
Yutao Sun
Tianzhu Ye
Li Dong
Yuqing Xia
Jian Chen
Yizhao Gao
S. Cao
Jianyong Wang
Furu Wei
340
7
0
04 Jun 2025
CLaSp: In-Context Layer Skip for Self-Speculative Decoding
CLaSp: In-Context Layer Skip for Self-Speculative DecodingAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Longze Chen
Renke Shan
Huiming Wang
Lu Wang
Ziqiang Liu
Run Luo
Jiawei Wang
Hamid Alinejad-Rokny
Min Yang
178
4
0
30 May 2025
MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing
MorphServe: Efficient and Workload-Aware LLM Serving via Runtime Quantized Layer Swapping and KV Cache Resizing
Zhaoyuan Su
Tingfeng Lan
Zirui Wang
Juncheng Yang
Yue Cheng
Juncheng Yang
Yue Cheng
325
1
0
24 May 2025
CoDec: Prefix-Shared Decoding Kernel for LLMs
CoDec: Prefix-Shared Decoding Kernel for LLMs
Zhibin Wang
Rui Ning
Chao Fang
Zhonghui Zhang
Xi Lin
...
Rong Gu
Kun Yang
Guihai Chen
Sheng Zhong
Chen Tian
257
6
0
23 May 2025
CASTILLO: Characterizing Response Length Distributions of Large Language Models
CASTILLO: Characterizing Response Length Distributions of Large Language Models
Daniel F. Perez-Ramirez
Dejan Kostic
Magnus Boman
218
1
0
22 May 2025
ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs
ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs
Yifan Sui
Hao Wang
Hanfei Yu
Yitao Hu
Jianxun Li
Hao Wang
279
1
0
20 May 2025
Chain-of-Model Learning for Language Model
Chain-of-Model Learning for Language Model
Kaitao Song
Xiaohua Wang
Xu Tan
Huiqiang Jiang
Chengruidong Zhang
...
Xiaoqing Zheng
Tao Qin
Yuqing Yang
Dongsheng Li
Lili Qiu
LRMAI4CE
632
1
0
17 May 2025
Patchwork: A Unified Framework for RAG Serving
Patchwork: A Unified Framework for RAG Serving
Bodun Hu
Luis Pabon
Saurabh Agarwal
Aditya Akella
285
0
0
01 May 2025
Prefill-level Jailbreak: A Black-Box Risk Analysis of Large Language Models
Prefill-level Jailbreak: A Black-Box Risk Analysis of Large Language Models
Yakai Li
Jiekang Hu
Weiduan Sang
Luping Ma
Jing Xie
Weijuan Zhang
Aimin Yu
Shijie Zhao
Qingjia Huang
Qihang Zhou
AAML
407
2
0
28 Apr 2025
KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments
KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments
Junyoung Park
Dalton Jones
Matthew J Morse
Raghavv Goel
Mingu Lee
Chris Lott
491
21
0
21 Apr 2025
Splitwiser: Efficient LM inference with constrained resources
Splitwiser: Efficient LM inference with constrained resources
Asad Aali
Adney Cardoza
Melissa Capo
195
0
0
21 Apr 2025
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints
Optimizing LLM Inference: Fluid-Guided Online Scheduling with Memory Constraints
Ruicheng Ao
Gan Luo
D. Simchi-Levi
Xinshang Wang
349
12
0
15 Apr 2025
Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents
Throughput-Optimal Scheduling Algorithms for LLM Inference and AI Agents
Yueying Li
Jim Dai
Tianyi Peng
716
10
0
10 Apr 2025
AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design
AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design
Yanbiao Liang
Huihong Shi
Haikuo Shao
Zhongfeng Wang
324
6
0
07 Apr 2025
FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling
FlowKV: A Disaggregated Inference Framework with Low-Latency KV Cache Transfer and Load-Aware Scheduling
Weiqing Li
Guochao Jiang
Xiangyong Ding
Zhangcheng Tao
Chuzhan Hao
Chenfeng Xu
Yuewei Zhang
Hao Wang
320
7
0
03 Apr 2025
Niyama : Breaking the Silos of LLM Inference Serving
Niyama : Breaking the Silos of LLM Inference Serving
Kanishk Goel
Jayashree Mohan
Nipun Kwatra
Ravi Anupindi
Ramachandran Ramjee
416
4
0
28 Mar 2025
Seesaw: High-throughput LLM Inference via Model Re-sharding
Seesaw: High-throughput LLM Inference via Model Re-sharding
Qidong Su
Wei Zhao
Xuelong Li
Muralidhar Andoorveedu
Chenhao Jiang
Zhanda Zhu
Kevin Song
Christina Giannoula
Gennady Pekhimenko
LRM
413
11
0
09 Mar 2025
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing System
PAPI: Exploiting Dynamic Parallelism in Large Language Model Decoding with a Processing-In-Memory-Enabled Computing SystemInternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2025
Yintao He
Haiyu Mao
Christina Giannoula
Mohammad Sadrosadati
Juan Gómez Luna
Huawei Li
Xiaowei Li
Ying Wang
O. Mutlu
499
34
0
21 Feb 2025
Autellix: An Efficient Serving Engine for LLM Agents as General Programs
Autellix: An Efficient Serving Engine for LLM Agents as General Programs
Michael Luo
Xiaoxiang Shi
Colin Cai
Tianjun Zhang
Justin Wong
...
Chi Wang
Yanping Huang
Zhifeng Chen
Alfons Kemper
Ion Stoica
399
27
0
20 Feb 2025
Neural Attention Search
Neural Attention Search
Difan Deng
Marius Lindauer
589
1
0
18 Feb 2025
TOPLOC: A Locality Sensitive Hashing Scheme for Trustless Verifiable Inference
TOPLOC: A Locality Sensitive Hashing Scheme for Trustless Verifiable Inference
Jack Min Ong
Matthew Di Ferrante
Aaron Pazdera
Ryan Garner
Sami Jaghouar
Manveer Basra
Max Ryabinin
Johannes Hagemann
LRM
404
13
0
27 Jan 2025
TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs
  via Bidirectional Communication
TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication
Zongwu Wang
Fangxin Liu
Mingshuai Li
Li Jiang
LRM
365
2
0
29 Dec 2024
Deploying Foundation Model Powered Agent Services: A Survey
Deploying Foundation Model Powered Agent Services: A Survey
Wenchao Xu
Jinyu Chen
Peirong Zheng
Xiaoquan Yi
Tianyi Tian
...
Quan Wan
Yining Qi
Yunfeng Fan
Qinliang Su
Xuemin Shen
AI4CE
564
7
0
18 Dec 2024
Accelerating Retrieval-Augmented Generation
Accelerating Retrieval-Augmented GenerationInternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
Derrick Quinn
Mohammad Nouri
Neel Patel
John Salihu
Alireza Salemi
Sukhan Lee
Hamed Zamani
Mohammad Alian
RALM3DV
434
36
0
14 Dec 2024
A dynamic parallel method for performance optimization on hybrid CPUs
A dynamic parallel method for performance optimization on hybrid CPUs
Luo Yu
Liu Yucheng
Shen Haihao
251
0
0
29 Nov 2024
FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware
  Large Language Model Serving
FastSwitch: Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving
Ao Shen
Zhiyao Li
Mingyu Gao
296
6
0
27 Nov 2024
Ensuring Fair LLM Serving Amid Diverse Applications
Ensuring Fair LLM Serving Amid Diverse Applications
Redwan Ibne Seraj Khan
Kunal Jain
Haiying Shen
Ankur Mallick
Anjaly Parayil
...
Yue Cheng
A. R. Butt
Victor Rühle
Chetan Bansal
Saravan Rajmohan
269
1
0
24 Nov 2024
Software Performance Engineering for Foundation Model-Powered Software
Software Performance Engineering for Foundation Model-Powered Software
Haoxiang Zhang
Shi Chang
Arthur Leung
Kishanthan Thangarajah
Boyuan Chen
Hanan Lutfiyya
Ahmed E. Hassan
622
2
0
14 Nov 2024
AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation
  with Out-of-order Execution
AI Metropolis: Scaling Large Language Model-based Multi-Agent Simulation with Out-of-order Execution
Zhiqiang Xie
Hao Kang
Ying Sheng
Tushar Krishna
Kayvon Fatahalian
Christos Kozyrakis
LRMAI4CELLMAGLM&Ro
257
11
0
05 Nov 2024
BATON: Enhancing Batch-wise Inference Efficiency for Large Language
  Models via Dynamic Re-batching
BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batchingThe Web Conference (WWW), 2024
Peizhuang Cong
Qizhi Chen
Haochen Zhao
Tong Yang
KELM
215
3
0
24 Oct 2024
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM
  Inference
POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM InferenceInternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024
Aditya K Kamath
Ramya Prabhu
Jayashree Mohan
Simon Peter
Ramachandran Ramjee
Ashish Panwar
282
61
0
23 Oct 2024
A Survey: Collaborative Hardware and Software Design in the Era of Large
  Language Models
A Survey: Collaborative Hardware and Software Design in the Era of Large Language ModelsIEEE Circuits and Systems Magazine (IEEE CSM), 2024
Cong Guo
Feng Cheng
Zhixu Du
James Kiessling
Jonathan Ku
...
Qilin Zheng
Guanglei Zhou
Hai
Li-Wei Li
Yiran Chen
268
25
0
08 Oct 2024
Geometric Collaborative Filtering with Convergence
Geometric Collaborative Filtering with ConvergenceInternational Conference on Artificial Intelligence and Statistics (AISTATS), 2024
Hisham Husain
Julien Monteil
FedML
535
23
0
04 Oct 2024
Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices
Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices
Yuxiang Huang
Binhang Yuan
Xu Han
Chaojun Xiao
Zhiyuan Liu
RALM
631
12
0
02 Oct 2024
12
Next
Page 1 of 2