Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.14808
Cited By
RelayAttention for Efficient Large Language Model Serving with Long System Prompts
22 February 2024
Lei Zhu
Xinjiang Wang
Wayne Zhang
Rynson W. H. Lau
Re-assign community
ArXiv
PDF
HTML
Papers citing
"RelayAttention for Efficient Large Language Model Serving with Long System Prompts"
4 / 4 papers shown
Title
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs
Yaxiong Wu
Sheng Liang
Chen Zhang
Y. Wang
Y. Zhang
Huifeng Guo
Ruiming Tang
Y. Liu
KELM
36
0
0
22 Apr 2025
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Zihao Ye
Lequn Chen
Ruihang Lai
Wuwei Lin
Yineng Zhang
...
Tianqi Chen
Baris Kasikci
Vinod Grover
Arvind Krishnamurthy
Luis Ceze
59
19
0
02 Jan 2025
DeFT: Decoding with Flash Tree-attention for Efficient Tree-structured LLM Inference
Jinwei Yao
Kaiqi Chen
Kexun Zhang
Jiaxuan You
Binhang Yuan
Zeke Wang
Tao Lin
20
2
0
30 Mar 2024
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism
DeepSeek-AI Xiao Bi
:
Xiao Bi
Deli Chen
Guanting Chen
...
Yao Zhao
Shangyan Zhou
Shunfeng Zhou
Qihao Zhu
Yuheng Zou
LRM
ALM
136
298
0
05 Jan 2024
1