Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2401.02669
Cited By
Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache
5 January 2024
Bin Lin
Chen Zhang
Tao Peng
Hanyu Zhao
Wencong Xiao
Minmin Sun
Qi Xu
Zhipeng Zhang
Lanbo Li
Xiafei Qiu
Shen Li
Zhigang Ji
Tao Xie
Yong Li
Wei Lin
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache"
15 / 15 papers shown
Title
SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models
Hang Wu
Jianian Zhu
Y. Li
Haojie Wang
Biao Hou
Jidong Zhai
25
0
0
12 May 2025
Taming the Titans: A Survey of Efficient LLM Inference Serving
Ranran Zhen
J. Li
Yixin Ji
Z. Yang
Tong Liu
Qingrong Xia
Xinyu Duan
Z. Wang
Baoxing Huai
M. Zhang
LLMAG
77
0
0
28 Apr 2025
Cognitive Memory in Large Language Models
Lianlei Shan
Shixian Luo
Zezhou Zhu
Yu Yuan
Yong Wu
LLMAG
KELM
92
1
0
03 Apr 2025
Seesaw: High-throughput LLM Inference via Model Re-sharding
Qidong Su
Wei Zhao
X. Li
Muralidhar Andoorveedu
Chenhao Jiang
Zhanda Zhu
Kevin Song
Christina Giannoula
Gennady Pekhimenko
LRM
72
0
0
09 Mar 2025
HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
Ting Sun
Penghan Wang
Fan Lai
86
1
0
15 Jan 2025
LoL-PIM: Long-Context LLM Decoding with Scalable DRAM-PIM System
Hyucksung Kwon
Kyungmo Koo
Janghyeon Kim
W. Lee
Minjae Lee
...
Yongkee Kwon
Ilkon Kim
Euicheol Lim
John Kim
Jungwook Choi
66
4
0
28 Dec 2024
InstInfer: In-Storage Attention Offloading for Cost-Effective Long-Context LLM Inference
Xiurui Pan
Endian Li
Qiao Li
Shengwen Liang
Yizhou Shan
Ke Zhou
Yingwei Luo
Xiaolin Wang
Jie Zhang
33
10
0
08 Sep 2024
Teola: Towards End-to-End Optimization of LLM-based Applications
Xin Tan
Yimin Jiang
Yitao Yang
Hong-Yu Xu
57
5
0
29 Jun 2024
Adaptive Layer Splitting for Wireless LLM Inference in Edge Computing: A Model-Based Reinforcement Learning Approach
Yuxuan Chen
Rongpeng Li
Xiaoxue Yu
Zhifeng Zhao
Honggang Zhang
34
9
0
03 Jun 2024
Parrot: Efficient Serving of LLM-based Applications with Semantic Variable
Chaofan Lin
Zhenhua Han
Chengruidong Zhang
Yuqing Yang
Fan Yang
Chen Chen
Lili Qiu
71
38
0
30 May 2024
FastDecode: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines
Jiaao He
Jidong Zhai
19
27
0
18 Mar 2024
NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens
Cunxiang Wang
Ruoxi Ning
Boqi Pan
Tonghui Wu
Qipeng Guo
...
Guangsheng Bao
Xiangkun Hu
Zheng Zhang
Qian Wang
Yue Zhang
RALM
74
3
0
18 Mar 2024
LongHealth: A Question Answering Benchmark with Long Clinical Documents
Lisa Christine Adams
Felix Busch
T. Han
Jean-Baptiste Excoffier
Matthieu Ortala
Alexander Loser
Hugo J. W. L. Aerts
Jakob Nikolas Kather
Daniel Truhn
Keno Bressem
ELM
LM&MA
AI4MH
32
9
0
25 Jan 2024
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Ying Sheng
Lianmin Zheng
Binhang Yuan
Zhuohan Li
Max Ryabinin
...
Joseph E. Gonzalez
Percy Liang
Christopher Ré
Ion Stoica
Ce Zhang
144
365
0
13 Mar 2023
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
243
1,815
0
17 Sep 2019
1