Communities
Connect sessions
AI calendar
Organizations
Contact Sales
Search
Open menu
Home
Papers
2401.06104
Cited By
v1
v2 (latest)
Transformers are Multi-State RNNs
11 January 2024
Matanel Oren
Michael Hassid
Nir Yarden
Yossi Adi
Roy Schwartz
OffRL
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (39 upvotes)
Github (126★)
Papers citing
"Transformers are Multi-State RNNs"
28 / 28 papers shown
Title
Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
Alessio Devoto
Maximilian Jeblick
Simon Jégou
MQ
VLM
0
0
0
01 Oct 2025
KVCompose: Efficient Structured KV Cache Compression with Composite Tokens
Dmitry Akulov
Mohamed Sana
A. De Domenico
Tareq Si Salem
Nicola Piovesan
Fadhel Ayed
MQ
36
0
0
05 Sep 2025
Spotlight Attention: Towards Efficient LLM Generation via Non-linear Hashing-based KV Cache Retrieval
Wenhao Li
Yuxin Zhang
Gen Luo
Haiyuan Wan
Ziyang Gong
Fei Chao
Rongrong Ji
0
0
0
27 Aug 2025
TPLA: Tensor Parallel Latent Attention for Efficient Disaggregated Prefill and Decode Inference
Xiaojuan Tang
Fanxu Meng
Pingzhi Tang
Yuxuan Wang
Di Yin
Xing Sun
M. Zhang
28
0
0
21 Aug 2025
SmallKV: Small Model Assisted Compensation of KV Cache Compression for Efficient LLM Inference
Yi Zhao
Yajuan Peng
Cam-Tu Nguyen
Zuchao Li
Xiaoliang Wang
Hai Zhao
Xiaoming Fu
57
0
0
03 Aug 2025
HCAttention: Extreme KV Cache Compression via Heterogeneous Attention Computing for LLMs
Dongquan Yang
Yifan Yang
Xiaotian Yu
Xianbiao Qi
Rong Xiao
MQ
64
0
0
26 Jul 2025
LazyEviction: Lagged KV Eviction with Attention Pattern Observation for Efficient Long Reasoning
Haoyue Zhang
Hualei Zhang
Xiaosong Ma
Jie Zhang
Song Guo
LRM
79
1
0
19 Jun 2025
Homogeneous Keys, Heterogeneous Values: Exploiting Local KV Cache Asymmetry for Long-Context LLMs
Wanyun Cui
Mingwei Xu
81
0
0
04 Jun 2025
Compress, Gather, and Recompute: REFORMing Long-Context Processing in Transformers
Woomin Song
Sai Muralidhar Jayanthi
S. Ronanki
Kanthashree Mysore Sathyendra
Jinwoo Shin
Aram Galstyan
Shubham Katiyar
S. Bodapati
VLM
107
0
0
01 Jun 2025
KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
Jang-Hyun Kim
Jinuk Kim
S. Kwon
Jae W. Lee
Sangdoo Yun
Hyun Oh Song
MQ
VLM
142
4
0
29 May 2025
KeyDiff: Key Similarity-Based KV Cache Eviction for Long-Context LLM Inference in Resource-Constrained Environments
Junyoung Park
Dalton Jones
Matthew J Morse
Raghavv Goel
Mingu Lee
Chris Lott
189
6
0
21 Apr 2025
Adaptive Computation Pruning for the Forgetting Transformer
Zhixuan Lin
J. Obando-Ceron
Xu Owen He
Rameswar Panda
163
2
0
09 Apr 2025
SentenceKV: Efficient LLM Inference via Sentence-Level Semantic KV Caching
Yuxuan Zhu
Ali Falahati
David H. Yang
Mohammad Mohammadi Amiri
151
0
0
01 Apr 2025
CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences
Ziran Qin
Yuchen Cao
Mingbao Lin
Wen Hu
Shixuan Fan
Ke Cheng
Weiyao Lin
Jianguo Li
165
12
0
16 Mar 2025
KVLink: Accelerating Large Language Models via Efficient KV Cache Reuse
Jingbo Yang
Bairu Hou
Wei Wei
Yujia Bao
Shiyu Chang
VLM
313
9
0
21 Feb 2025
Graph-Aware Isomorphic Attention for Adaptive Dynamics in Transformers
Markus J. Buehler
AI4CE
217
5
0
04 Jan 2025
Squeezed Attention: Accelerating Long Context Length LLM Inference
Coleman Hooper
Sehoon Kim
Hiva Mohammadzadeh
Monishwaran Maheswaran
June Paik
Michael W. Mahoney
Kemal Kurniawan
Amir Gholami
Amir Gholami
274
23
0
14 Nov 2024
An Evolved Universal Transformer Memory
Edoardo Cetin
Qi Sun
Tianyu Zhao
Yujin Tang
744
2
0
17 Oct 2024
FiRST: Finetuning Router-Selective Transformers for Input-Adaptive Latency Reduction
Akriti Jain
Saransh Sharma
Koyel Mukherjee
Soumyabrata Pal
129
1
0
16 Oct 2024
In-context KV-Cache Eviction for LLMs via Attention-Gate
Zihao Zeng
Bokai Lin
Tianqi Hou
Hao Zhang
Zhijie Deng
185
3
0
15 Oct 2024
Towards LifeSpan Cognitive Systems
Yu Wang
Chi Han
Tongtong Wu
Xiaoxin He
Wangchunshu Zhou
...
Zexue He
Wei Wang
Gholamreza Haffari
Heng Ji
Julian McAuley
KELM
CLL
692
5
0
20 Sep 2024
DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion
Yilong Chen
Linhao Zhang
Junyuan Shang
Ying Tai
Tingwen Liu
Shuohuan Wang
Yu Sun
91
3
0
03 Jun 2024
Understanding the differences in Foundation Models: Attention, State Space Models, and Recurrent Neural Networks
Jerome Sieber
Carmen Amo Alonso
A. Didier
Melanie Zeilinger
Antonio Orvieto
AAML
182
15
0
24 May 2024
CORM: Cache Optimization with Recent Message for Large Language Model Inference
Jincheng Dai
Zhuowei Huang
Haiyun Jiang
Chen Chen
Deng Cai
Wei Bi
Shuming Shi
137
5
0
24 Apr 2024
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Piotr Nawrot
Adrian Lañcucki
Marcin Chochowski
David Tarjan
Edoardo Ponti
162
76
0
14 Mar 2024
Yi: Open Foundation Models by 01.AI
01. AI
Alex Young
01.AI Alex Young
Bei Chen
Chao Li
...
Yue Wang
Yuxuan Cai
Zhenyu Gu
Zhiyuan Liu
Zonghong Dai
OSLM
LRM
442
665
0
07 Mar 2024
Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding
Zhenyu Zhang
Runjin Chen
Shiwei Liu
Zhewei Yao
Olatunji Ruwase
Beidi Chen
Xiaoxia Wu
Zhangyang Wang
143
51
0
05 Mar 2024
Gated Linear Attention Transformers with Hardware-Efficient Training
Aaron Courville
Bailin Wang
Songlin Yang
Yikang Shen
Yoon Kim
212
240
0
11 Dec 2023
1