Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.04934
Cited By
Prompt Cache: Modular Attention Reuse for Low-Latency Inference
7 November 2023
In Gim
Guojun Chen
Seung-seob Lee
Nikhil Sarda
Anurag Khandelwal
Lin Zhong
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Prompt Cache: Modular Attention Reuse for Low-Latency Inference"
50 / 52 papers shown
Title
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs
Yaxiong Wu
Sheng Liang
Chen Zhang
Y. Wang
Y. Zhang
Huifeng Guo
Ruiming Tang
Y. Liu
KELM
34
0
0
22 Apr 2025
Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management
Hang Zhang
Jiuchen Shi
Yixiao Wang
Quan Chen
Yizhou Shan
Minyi Guo
20
0
0
19 Apr 2025
HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
Avinash Kumar
Shashank Nag
Jason Clemons
L. John
Poulami Das
24
0
0
14 Apr 2025
HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse
Yuwei An
Yihua Cheng
Seo Jin Park
Junchen Jiang
36
1
0
03 Apr 2025
Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference
Pol G. Recasens
Ferran Agullo
Yue Zhu
Chen Wang
Eun Kyung Lee
Olivier Tardieu
Jordi Torres
Josep Ll. Berral
36
0
0
11 Mar 2025
KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse
Jingbo Yang
Bairu Hou
Wei Wei
Yujia Bao
Shiyu Chang
VLM
26
2
0
21 Feb 2025
Auditing Prompt Caching in Language Model APIs
Chenchen Gu
Xiang Lisa Li
Rohith Kuditipudi
Percy Liang
Tatsunori Hashimoto
68
0
0
11 Feb 2025
AdaServe: SLO-Customized LLM Serving with Fine-Grained Speculative Decoding
Zikun Li
Zhuofu Chen
Remi Delacourt
Gabriele Oliaro
Zeyu Wang
...
Zhihao Zhang
Zhuoming Chen
Sean Lai
Xupeng Miao
Zhihao Jia
47
5
0
21 Jan 2025
HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
Ting Sun
Penghan Wang
Fan Lai
43
1
0
15 Jan 2025
FlexCache: Flexible Approximate Cache System for Video Diffusion
Desen Sun
Henry Tian
Tim Lu
Sihang Liu
DiffM
28
0
0
18 Dec 2024
Accelerating Retrieval-Augmented Generation
Derrick Quinn
Mohammad Nouri
Neel Patel
John Salihu
Alireza Salemi
Sukhan Lee
Hamed Zamani
Mohammad Alian
RALM
3DV
78
2
0
14 Dec 2024
Dipper: Diversity in Prompts for Producing Large Language Model Ensembles in Reasoning tasks
Gregory Kang Ruey Lau
Wenyang Hu
Diwen Liu
Jizhuo Chen
S. Ng
Bryan Kian Hsiang Low
LRM
AI4CE
68
7
0
12 Dec 2024
SocialMind: LLM-based Proactive AR Social Assistive System with Human-like Perception for In-situ Live Interactions
Bufang Yang
Yunqi Guo
Lilin Xu
Zhenyu Yan
Hongkai Chen
Guoliang Xing
Xiaofan Jiang
62
0
0
05 Dec 2024
Marconi: Prefix Caching for the Era of Hybrid LLMs
Rui Pan
Zhuang Wang
Zhen Jia
Can Karakus
Luca Zancato
Tri Dao
Ravi Netravali
Yida Wang
84
4
0
28 Nov 2024
PyGen: A Collaborative Human-AI Approach to Python Package Creation
Saikat Barua
Mostafizur Rahman
Md Jafor Sadek
Rafiul Islam
Shehnaz Khaled
Md. Shohrab Hossain
32
1
0
13 Nov 2024
DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving
Yuhan Liu
Esha Choukse
Shan Lu
Junchen Jiang
Madan Musuvathi
...
Yihua Cheng
Junchen Jiang
Shan Lu
Madan Musuvathi
Esha Choukse
78
2
0
05 Nov 2024
PMoL: Parameter Efficient MoE for Preference Mixing of LLM Alignment
Dongxu Liu
Bing Xu
Yinzhuo Chen
Bufan Xu
Wenpeng Lu
Muyun Yang
T. Zhao
MoE
24
1
0
02 Nov 2024
Privacy Risks of Speculative Decoding in Large Language Models
Jiankun Wei
Abdulrahman Abdulrazzag
Tianchen Zhang
Adel Muursepp
Gururaj Saileshwar
28
2
0
01 Nov 2024
EPIC: Efficient Position-Independent Context Caching for Serving Large Language Models
Junhao Hu
Wenrui Huang
H. Wang
Weidong Wang
Tiancheng Hu
Qin Zhang
Hao Feng
Xusheng Chen
Yizhou Shan
Tao Xie
RALM
LLMAG
18
2
0
20 Oct 2024
Fast State Restoration in LLM Serving with HCache
Shiwei Gao
Youmin Chen
Jiwu Shu
23
4
0
07 Oct 2024
Geometric Collaborative Filtering with Convergence
Hisham Husain
Julien Monteil
FedML
18
0
0
04 Oct 2024
The Early Bird Catches the Leak: Unveiling Timing Side Channels in LLM Serving Systems
Linke Song
Zixuan Pang
Wenhao Wang
Zihao Wang
XiaoFeng Wang
Hongbo Chen
Wei Song
Yier Jin
Dan Meng
Rui Hou
36
6
0
30 Sep 2024
Confidential Prompting: Protecting User Prompts from Cloud LLM Providers
In Gim
Caihua Li
Lin Zhong
35
2
0
27 Sep 2024
Do Large Language Models Need a Content Delivery Network?
Yihua Cheng
Kuntai Du
Jiayi Yao
Junchen Jiang
KELM
28
7
0
16 Sep 2024
P/D-Serve: Serving Disaggregated Large Language Model at Scale
Yibo Jin
Tao Wang
Huimin Lin
Mingyang Song
Peiyang Li
...
Haoliang Cheng
Xiaojing Li
Jiandong Ding
Hefei Guo
Zhengyong Zhang
MoE
14
8
0
15 Aug 2024
MLLM Is a Strong Reranker: Advancing Multimodal Retrieval-augmented Generation via Knowledge-enhanced Reranking and Noise-injected Training
Rivik Setty
Chengjin Xu
Vinay Setty
Jian Guo
17
12
0
31 Jul 2024
LLM Inference Serving: Survey of Recent Advances and Opportunities
Baolin Li
Yankai Jiang
V. Gadepally
Devesh Tiwari
64
15
0
17 Jul 2024
Teola: Towards End-to-End Optimization of LLM-based Applications
Xin Tan
Yimin Jiang
Yitao Yang
Hong-Yu Xu
40
4
0
29 Jun 2024
MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool
Cunchen Hu
Heyang Huang
Junhao Hu
Jiang Xu
Xusheng Chen
...
Chenxi Wang
Sa Wang
Yungang Bao
Ninghui Sun
Yizhou Shan
LLMAG
30
12
0
25 Jun 2024
Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving
Ruoyu Qin
Zheming Li
Weiran He
Mingxing Zhang
Yongwei Wu
Weimin Zheng
Xinran Xu
29
51
0
24 Jun 2024
CacheBlend: Fast Large Language Model Serving for RAG with Cached Knowledge Fusion
Jiayi Yao
Hanchen Li
Yuhan Liu
Siddhant Ray
Yihua Cheng
Qizheng Zhang
Kuntai Du
Shan Lu
Junchen Jiang
42
12
0
26 May 2024
Preble: Efficient Distributed Prompt Scheduling for LLM Serving
Vikranth Srivatsa
Zijian He
Reyna Abhyankar
Dongming Li
Yiying Zhang
40
17
0
08 May 2024
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation
Chao Jin
Zili Zhang
Xuanlin Jiang
Fangyue Liu
Xin Liu
Xuanzhe Liu
Xin Jin
24
36
0
18 Apr 2024
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning
Xiao Wang
Tianze Chen
Xianjun Yang
Qi Zhang
Xun Zhao
Dahua Lin
ELM
25
5
0
16 Apr 2024
Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation
Thomas Merth
Qichen Fu
Mohammad Rastegari
Mahyar Najibi
LRM
RALM
19
8
0
10 Apr 2024
Towards Pareto Optimal Throughput in Small Language Model Serving
Pol G. Recasens
Yue Zhu
Chen Wang
Eun Kyung Lee
Olivier Tardieu
Alaa Youssef
Jordi Torres
Josep Ll. Berral
20
4
0
04 Apr 2024
DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference
Jinwei Yao
Kaiqi Chen
Kexun Zhang
Jiaxuan You
Binhang Yuan
Zeke Wang
Tao Lin
20
2
0
30 Mar 2024
Hierarchical Skip Decoding for Efficient Autoregressive Text Generation
Yunqi Zhu
Xuebing Yang
Yuanyuan Wu
Wensheng Zhang
14
3
0
22 Mar 2024
Toward Sustainable GenAI using Generation Directives for Carbon-Friendly Large Language Model Inference
Baolin Li
Yankai Jiang
V. Gadepally
Devesh Tiwari
19
14
0
19 Mar 2024
Rethinking Software Engineering in the Foundation Model Era: A Curated Catalogue of Challenges in the Development of Trustworthy FMware
Ahmed E. Hassan
Dayi Lin
Gopi Krishnan Rajbahadur
Keheliya Gallaba
F. Côgo
...
Kishanthan Thangarajah
G. Oliva
Jiahuei Lin
Wali Mohammad Abdullah
Zhen Ming Jiang
16
7
0
25 Feb 2024
RelayAttention for Efficient Large Language Model Serving with Long System Prompts
Lei Zhu
Xinjiang Wang
Wayne Zhang
Rynson W. H. Lau
18
4
0
22 Feb 2024
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems
Xupeng Miao
Gabriele Oliaro
Zhihao Zhang
Xinhao Cheng
Hongyi Jin
Tianqi Chen
Zhihao Jia
32
75
0
23 Dec 2023
SGLang: Efficient Execution of Structured Language Model Programs
Lianmin Zheng
Liangsheng Yin
Zhiqiang Xie
Chuyue Sun
Jeff Huang
...
Christos Kozyrakis
Ion Stoica
Joseph E. Gonzalez
Clark W. Barrett
Ying Sheng
LRM
29
102
0
12 Dec 2023
Stateful Large Language Model Serving with Pensieve
Lingfan Yu
Jinyang Li
RALM
KELM
LLMAG
21
11
0
09 Dec 2023
TypeFly: Flying Drones with Large Language Model
Guojun Chen
Xiaojing Yu
Lin Zhong
27
8
0
08 Dec 2023
Explore, Select, Derive, and Recall: Augmenting LLM with Human-like Memory for Mobile Task Automation
Sunjae Lee
Junyoung Choi
Jungjae Lee
Munim Hasan Wasi
Hojun Choi
Steven Y. Ko
Sangeun Oh
Insik Shin
RALM
16
25
0
04 Dec 2023
The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context Learning
Bill Yuchen Lin
Abhilasha Ravichander
Ximing Lu
Nouha Dziri
Melanie Sclar
Khyathi Raghavi Chandu
Chandra Bhagavatula
Yejin Choi
17
163
0
04 Dec 2023
CacheGen: KV Cache Compression and Streaming for Fast Language Model Serving
Yuhan Liu
Hanchen Li
Yihua Cheng
Siddhant Ray
Yuyang Huang
...
Ganesh Ananthanarayanan
Michael Maire
Henry Hoffmann
Ari Holtzman
Junchen Jiang
37
41
0
11 Oct 2023
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Ying Sheng
Lianmin Zheng
Binhang Yuan
Zhuohan Li
Max Ryabinin
...
Joseph E. Gonzalez
Percy Liang
Christopher Ré
Ion Stoica
Ce Zhang
135
208
0
13 Mar 2023
Latency Adjustable Transformer Encoder for Language Understanding
Sajjad Kachuee
M. Sharifkhani
6
0
0
10 Jan 2022
1
2
Next