Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1911.02150
Cited By
Fast Transformer Decoding: One Write-Head is All You Need
6 November 2019
Noam M. Shazeer
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (9 upvotes)
Papers citing
"Fast Transformer Decoding: One Write-Head is All You Need"
50 / 421 papers shown
Title
Identifying and Evaluating Inactive Heads in Pretrained LLMs
Pedro Sandoval-Segura
Xijun Wang
Ashwinee Panda
Micah Goldblum
Ronen Basri
Tom Goldstein
David Jacobs
292
1
0
04 Apr 2025
SQuat: Subspace-orthogonal KV Cache Quantization
Hao Wang
Ligong Han
Kai Xu
Akash Srivastava
MQ
281
2
0
31 Mar 2025
Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference
Design, Automation and Test in Europe (DATE), 2025
Wei Tao
Bin Zhang
Xiaoyang Qu
Jiguang Wan
Jianzong Wang
314
3
0
30 Mar 2025
LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation
Han Chen
Zicong Jiang
Zining Zhang
Bingsheng He
Pingyi Luo
Minghao Lu
Yuqiang Chen
MQ
154
0
0
25 Mar 2025
BitDecoding: Unlocking Tensor Cores for Long-Context LLMs with Low-Bit KV Cache
Dayou Du
Shijie Cao
Jianyi Cheng
Ting Cao
M. Yang
Mao Yang
MQ
778
1
0
24 Mar 2025
Oaken: Fast and Efficient LLM Serving with Online-Offline Hybrid KV Cache Quantization
International Symposium on Computer Architecture (ISCA), 2025
Minsu Kim
Seongmin Hong
RyeoWook Ko
S. Choi
Hunjong Lee
Junsoo Kim
Joo-Young Kim
Jongse Park
237
3
0
24 Mar 2025
WindowKV: Task-Adaptive Group-Wise KV Cache Window Selection for Efficient LLM Inference
Youhui Zuo
Sibo Wei
C. Zhang
Zhuorui Liu
Sibo Wei
Dawei Song
VLM
334
1
0
23 Mar 2025
SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs
Shibo Jie
Yehui Tang
Kai Han
Zhi-Hong Deng
Jing Han
243
3
0
20 Mar 2025
ATTENTION2D: Communication Efficient Distributed Self-Attention Mechanism
Venmugil Elango
331
1
0
20 Mar 2025
GPU-Accelerated Motion Planning of an Underactuated Forestry Crane in Cluttered Environments
M. Vu
Gerald Ebmer
Alexander Watcher
Marc-Philip Ecker
Giang Nguyen
Tobias Glueck
236
4
0
18 Mar 2025
CAKE: Cascading and Adaptive KV Cache Eviction with Layer Preferences
International Conference on Learning Representations (ICLR), 2025
Ziran Qin
Yuchen Cao
Mingbao Lin
Wen Hu
Shixuan Fan
Ke Cheng
Weiyao Lin
Jianguo Li
217
23
0
16 Mar 2025
A Review of DeepSeek Models' Key Innovative Techniques
Chengen Wang
Murat Kantarcioglu
VLM
OffRL
162
10
0
14 Mar 2025
Key, Value, Compress: A Systematic Exploration of KV Cache Compression Techniques
IEEE Custom Integrated Circuits Conference (CICC), 2025
Neusha Javidnia
B. Rouhani
F. Koushanfar
1.0K
3
0
14 Mar 2025
X-EcoMLA: Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression
Guihong Li
Mehdi Rezagholizadeh
Mingyu Yang
Vikram Appia
Emad Barsoum
VLM
313
1
0
14 Mar 2025
ZSMerge: Zero-Shot KV Cache Compression for Memory-Efficient Long-Context LLMs
Xin Liu
Xudong Wang
Pei Liu
Guoming Tang
MoMe
210
0
0
13 Mar 2025
Cost-Optimal Grouped-Query Attention for Long-Context Modeling
Yuxiao Chen
Yutong Wu
Chenyang Song
Zhiyuan Liu
Maosong Sun
Xu Han
Zhiyuan Liu
Maosong Sun
349
0
0
12 Mar 2025
Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference
IEEE International Conference on Cloud Computing (CLOUD), 2025
Pol G. Recasens
Ferran Agullo
Yue Zhu
Chen Wang
Eun Kyung Lee
Olivier Tardieu
Jordi Torres
Josep Ll. Berral
200
17
0
11 Mar 2025
TokenButler: Token Importance is Predictable
Yash Akhauri
Ahmed F. AbouElhamayed
Yifei Gao
Chi-chih Chang
Nilesh Jain
Mohamed S. Abdelfattah
171
2
0
10 Mar 2025
Slim attention: cut your context memory in half without loss -- K-cache is all you need for MHA
Nils Graef
Matthew Clapp
299
2
0
07 Mar 2025
SAGE-Amine: Generative Amine Design with Multi-Property Optimization for Efficient CO2 Capture
Hocheol Lim
Hyein Cho
Jeonghoon Kim
180
1
0
04 Mar 2025
Rethinking Light Decoder-based Solvers for Vehicle Routing Problems
International Conference on Learning Representations (ICLR), 2025
Ziwei Huang
Jianan Zhou
Zhiguang Cao
Yixin Xu
177
16
0
02 Mar 2025
Tutorial Proposal: Speculative Decoding for Efficient LLM Inference
Heming Xia
Cunxiao Du
Yongqian Li
Qian Liu
Wenjie Li
234
2
0
01 Mar 2025
Reasoning is Periodicity? Improving Large Language Models Through Effective Periodicity Modeling
Yihong Dong
Ge Li
Xue Jiang
Yongding Tao
Kechi Zhang
...
Huanyu Liu
Jiazheng Ding
Jia Li
Jinliang Deng
Hong Mei
AI4TS
471
2
0
28 Feb 2025
DBudgetKV: Dynamic Budget in KV Cache Compression for Ensuring Optimal Performance
Xuanfan Ni
Liyan Xu
Chenyang Lyu
Longyue Wang
Mo Yu
Lemao Liu
Fandong Meng
Jie Zhou
Piji Li
269
0
0
24 Feb 2025
KVCrush: Key value cache size-reduction using similarity in head-behaviour
Gopi Krishna Jha
Sameh Gobriel
Liubov Talamanova
Alexander Kozlov
Nilesh Jain
MQ
170
0
0
24 Feb 2025
Learning Humanoid Locomotion with World Model Reconstruction
Wandong Sun
L. Chen
Yongbo Su
Baoshi Cao
Yang Liu
Zongwu Xie
222
5
0
22 Feb 2025
CoKV: Optimizing KV Cache Allocation via Cooperative Game
Qiheng Sun
Hongwei Zhang
Haocheng Xia
Jiayao Zhang
Jinfei Liu
Kui Ren
VLM
159
0
0
21 Feb 2025
C2T: A Classifier-Based Tree Construction Method in Speculative Decoding
Feiye Huo
Jianchao Tan
Jianchao Tan
Xunliang Cai
Shengli Sun
165
3
0
20 Feb 2025
PLDR-LLMs Learn A Generalizable Tensor Operator That Can Replace Its Own Deep Neural Net At Inference
Burc Gokden
273
0
0
19 Feb 2025
RaaS: Reasoning-Aware Attention Sparsity for Efficient LLM Reasoning
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Junhao Hu
Wenrui Huang
Weidong Wang
Zhenwen Li
Tiancheng Hu
Zhixia Liu
Xusheng Chen
Tao Xie
Yizhou Shan
LRM
352
1
0
16 Feb 2025
Hybrid Offline-online Scheduling Method for Large Language Model Inference Optimization
Bowen Pang
Kai Li
Ruifeng She
Feifan Wang
OffRL
236
2
0
14 Feb 2025
Acceleration Multiple Heads Decoding for LLM via Dynamic Tree Attention
Zhendong Zhang
138
0
0
09 Feb 2025
Lossless Acceleration of Large Language Models with Hierarchical Drafting based on Temporal Locality in Speculative Decoding
North American Chapter of the Association for Computational Linguistics (NAACL), 2025
Sukmin Cho
S. Choi
T. Hwang
Jeongyeon Seo
Soyeong Jeong
Huije Lee
Hoyun Song
Jong C. Park
Youngjin Kwon
402
4
0
08 Feb 2025
Mass-Editing Memory with Attention in Transformers: A cross-lingual exploration of knowledge
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Daniel Tamayo
Aitor Gonzalez-Agirre
Javier Hernando
Marta Villegas
KELM
395
9
0
04 Feb 2025
Judge Decoding: Faster Speculative Sampling Requires Going Beyond Model Alignment
International Conference on Learning Representations (ICLR), 2025
Gregor Bachmann
Sotiris Anagnostidis
Albert Pumarola
Markos Georgopoulos
A. Sanakoyeu
Yuming Du
Edgar Schönfeld
Ali K. Thabet
Jonas Kohler
ALM
BDL
306
28
0
31 Jan 2025
Tensor Product Attention Is All You Need
Yifan Zhang
Yifeng Liu
Huizhuo Yuan
Zhen Qin
Yang Yuan
Q. Gu
Andrew Chi-Chih Yao
558
27
0
11 Jan 2025
Integrating LLMs with ITS: Recent Advances, Potentials, Challenges, and Future Directions
Doaa Mahmud
Hadeel Hajmohamed
Shamma Almentheri
Shamma Alqaydi
Lameya Aldhaheri
R. A. Khalil
Nasir Saeed
AI4TS
219
33
0
08 Jan 2025
Foundations of GenIR
Jiaxin Mao
Jingtao Zhan
Wenshu Fan
214
0
0
06 Jan 2025
DEHYDRATOR: Enhancing Provenance Graph Storage via Hierarchical Encoding and Sequence Generation
IEEE Transactions on Information Forensics and Security (IEEE TIFS), 2024
J. Ying
Tiantian Zhu
Mingqi Lv
Tieming Chen
105
0
0
03 Jan 2025
FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving
Zihao Ye
Lequn Chen
Ruihang Lai
Wuwei Lin
Yineng Zhang
...
Tianqi Chen
Baris Kasikci
Vinod Grover
Arvind Krishnamurthy
Luis Ceze
397
106
0
02 Jan 2025
TokenRing: An Efficient Parallelism Framework for Infinite-Context LLMs via Bidirectional Communication
Zongwu Wang
Fangxin Liu
Mingshuai Li
Li Jiang
LRM
259
1
0
29 Dec 2024
A Silver Bullet or a Compromise for Full Attention? A Comprehensive Study of Gist Token-based Context Compression
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Chenlong Deng
Zhisong Zhang
Kelong Mao
Shuaiyi Li
Xinting Huang
Dong Yu
Zhicheng Dou
255
7
0
23 Dec 2024
GenX: Mastering Code and Test Generation with Execution Feedback
Nan Wang
Yafei Liu
Chen Chen
H. Lu
213
2
0
18 Dec 2024
Deploying Foundation Model Powered Agent Services: A Survey
Wenchao Xu
Jinyu Chen
Peirong Zheng
Xiaoquan Yi
Tianyi Tian
...
Quan Wan
Yining Qi
Yunfeng Fan
Qinliang Su
Xuemin Shen
AI4CE
399
5
0
18 Dec 2024
SnapGen: Taming High-Resolution Text-to-Image Models for Mobile Devices with Efficient Architectures and Training
Computer Vision and Pattern Recognition (CVPR), 2024
Dongting Hu
Jierun Chen
Xijie Huang
Huseyin Coskun
Arpit Sahni
...
Mingming Gong
Sergey Tulyakov
Vidit Goel
Yanwu Xu
Jian Ren
VLM
250
15
0
12 Dec 2024
DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction
Symposium on Operating Systems Principles (SOSP), 2024
Yanqi Zhang
Yuwei Hu
Runyuan Zhao
John C. S. Lui
Haibo Chen
MQ
620
9
0
04 Dec 2024
RandAR: Decoder-only Autoregressive Visual Generation in Random Orders
Computer Vision and Pattern Recognition (CVPR), 2024
Ziqi Pang
Tianyuan Zhang
Fujun Luan
Yunze Man
Hao Tan
Kai Zhang
William T. Freeman
Yu-Xiong Wang
VGen
299
54
0
02 Dec 2024
Puzzle: Distillation-Based NAS for Inference-Optimized LLMs
Akhiad Bercovich
Tomer Ronen
Talor Abramovich
Nir Ailon
Nave Assaf
...
Ido Shahaf
Oren Tropp
Omer Ullman Argov
Ran Zilberstein
Ran El-Yaniv
675
8
0
28 Nov 2024
Continuous Autoregressive Models with Noise Augmentation Avoid Error Accumulation
Marco Pasini
J. Nistal
Stefan Lattner
George Fazekas
199
13
0
27 Nov 2024
MiniKV: Pushing the Limits of LLM Inference via 2-Bit Layer-Discriminative KV Cache
Akshat Sharma
Hangliang Ding
Jianping Li
Neel Dani
Minjia Zhang
373
2
0
27 Nov 2024
Previous
1
2
3
4
5
6
7
8
9
Next