Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
1911.02150
Cited By
Fast Transformer Decoding: One Write-Head is All You Need
6 November 2019
Noam M. Shazeer
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (9 upvotes)
Papers citing
"Fast Transformer Decoding: One Write-Head is All You Need"
50 / 421 papers shown
Title
Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining
Dongyang Fan
Diba Hashemi
Sai Praneeth Karimireddy
Martin Jaggi
65
0
0
26 Nov 2025
Accelerating Time Series Foundation Models with Speculative Decoding
Pranav Subbaraman
Fang Sun
Yue Yao
Huacong Tang
Xiao Luo
Yizhou Sun
AI4TS
148
0
0
22 Nov 2025
Global Cross-Time Attention Fusion for Enhanced Solar Flare Prediction from Multivariate Time Series
Onur Vural
S. M. Hamdi
S. F. Boubrahimi
AI4TS
64
0
0
17 Nov 2025
Lethe: Layer- and Time-Adaptive KV Cache Pruning for Reasoning-Intensive LLM Serving
Hui Zeng
Daming Zhao
Pengfei Yang
WenXuan Hou
Tianyang Zheng
Hui Li
Weiye Ji
Jidong Zhai
88
1
0
08 Nov 2025
Attention and Compression is all you need for Controllably Efficient Language Models
Jatin Prakash
A. Puli
Rajesh Ranganath
MQ
VLM
394
0
0
07 Nov 2025
From Prompts to Power: Measuring the Energy Footprint of LLM Inference
Francisco Caravaca
Ángel Cuevas
R. Cuevas
64
0
0
05 Nov 2025
SyMuPe: Affective and Controllable Symbolic Music Performance
Ilya Borovik
Dmitrii Gavrilev
Vladimir Viro
64
0
0
05 Nov 2025
Balancing Knowledge Updates: Toward Unified Modular Editing in LLMs
Jiahao Liu
Zijian Wang
Kuo Zhao
Dong Hu
KELM
104
0
0
31 Oct 2025
Knocking-Heads Attention
Zhanchao Zhou
Xiaodong Chen
Haoxing Chen
Zhenzhong Lan
Jianguo Li
72
0
0
27 Oct 2025
Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Decoder-Only Transformers
Marko Karbevski
Antonij Mijoski
103
0
0
27 Oct 2025
FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference
Divya J. Bajpai
M. Hanawal
MLLM
VLM
178
0
0
26 Oct 2025
Stateful KV Cache Management for LLMs: Balancing Space, Time, Accuracy, and Positional Fidelity
Pratik Poudel
KELM
96
0
0
23 Oct 2025
Reasoning Language Model Inference Serving Unveiled: An Empirical Study
Qi Li
Junpan Wu
Xiang Liu
Yuxin Wang
Z. Li
Zhenheng Tang
Yuhan Chen
Shaohuai Shi
Xiaowen Chu
ReLM
LRM
192
1
0
21 Oct 2025
Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads
Zhoutong Wu
Y. Zhang
Yiming Dong
Chenheng Zhang
Cong Fang
Kun Yuan
Zhouchen Lin
95
0
0
19 Oct 2025
QSVD: Efficient Low-rank Approximation for Unified Query-Key-Value Weight Compression in Low-Precision Vision-Language Models
Yutong Wang
Haiyu Wang
Sai Qian Zhang
64
0
0
18 Oct 2025
End-to-End Multi-Modal Diffusion Mamba
Chunhao Lu
Qiang Lu
Meichen Dong
Jake Luo
90
3
0
15 Oct 2025
Deconstructing Attention: Investigating Design Principles for Effective Language Modeling
Huiyin Xue
Nafise Sadat Moosavi
Nikolaos Aletras
80
0
0
13 Oct 2025
DELTA: Dynamic Layer-Aware Token Attention for Efficient Long-Context Reasoning
Hossein Entezari Zarch
Lei Gao
Chaoyi Jiang
Murali Annavarm
LRM
41
0
0
10 Oct 2025
Hierarchical Scheduling for Multi-Vector Image Retrieval
Maoliang Li
K. Li
Yaoyang Liu
Jiayu Chen
Zihao Zheng
Yinjun Wu
Xiang Chen
68
0
0
10 Oct 2025
Artificial Hippocampus Networks for Efficient Long-Context Modeling
Yunhao Fang
Weihao Yu
Shu Zhong
Qinghao Ye
Xuehan Xiong
Lai Wei
68
1
0
08 Oct 2025
AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding
Shuqing Luo
Yilin Guan
Pingzhi Li
Hanrui Wang
Tianlong Chen
92
0
0
08 Oct 2025
The End of Transformers? On Challenging Attention and the Rise of Sub-Quadratic Architectures
Alexander Fichtl
Jeremias Bohn
Josefin Kelber
Edoardo Mosca
Georg Groh
72
0
0
06 Oct 2025
Compressed Convolutional Attention: Efficient Attention in a Compressed Latent Space
Tomás Figliolia
Nicholas Alonso
Rishi Iyer
Quentin Anthony
Beren Millidge
MQ
92
1
0
06 Oct 2025
Recover-LoRA: Data-Free Accuracy Recovery of Degraded Language Models via Low-Rank Adaptation
Devleena Das
Rajeev Patwari
Ashish Sirasao
69
0
0
06 Oct 2025
Poolformer: Recurrent Networks with Pooling for Long-Sequence Modeling
Daniel Gallo Fernández
76
0
0
02 Oct 2025
SoundReactor: Frame-level Online Video-to-Audio Generation
Koichi Saito
Julian Tanke
Christian Simon
Masato Ishii
Kazuki Shimada
Zachary Novack
Zhi-Wei Zhong
Akio Hayakawa
Takashi Shibuya
Yuki Mitsufuji
DiffM
VGen
206
0
0
02 Oct 2025
Expected Attention: KV Cache Compression by Estimating Attention from Future Queries Distribution
Alessio Devoto
Maximilian Jeblick
Simon Jégou
MQ
VLM
72
2
0
01 Oct 2025
SAGE-Music: Low-Latency Symbolic Music Generation via Attribute-Specialized Key-Value Head Sharing
Jiaye Tan
Haonan Luo
Linfeng Song
Shuaiqi Chen
Yishan Lyu
...
Haoran Zhang
Jiaming Bai
Haoran Cheng
Q. Vera Liao
Hao-Wen Dong
124
0
0
01 Oct 2025
ProxyAttn: Guided Sparse Attention via Representative Heads
Yixuan Wang
H. He
Siqi Bao
H. Wu
Haifeng Wang
Qingfu Zhu
Wanxiang Che
72
1
0
29 Sep 2025
Alternatives To Next Token Prediction In Text Generation - A Survey
Charlie Wyatt
Aditya Joshi
Flora D. Salim
52
0
0
29 Sep 2025
Multi-Item-Query Attention for Stable Sequential Recommendation
Mingshi Xu
Haoren Zhu
Wilfred Siu Hung Ng
40
0
0
29 Sep 2025
VeriLLM: A Lightweight Framework for Publicly Verifiable Decentralized Inference
Ke Wang
Zishuo Zhao
Xinyuan Song
Bill Shi
Libin Xia
Chris Tong
Lynn Ai
Felix Qu
Eric Yang
173
0
0
29 Sep 2025
Self-Speculative Biased Decoding for Faster Live Translation
Linxiao Zeng
Haoyun Deng
Kangyuan Shu
Shizhen Wang
48
0
0
26 Sep 2025
FastGRPO: Accelerating Policy Optimization via Concurrency-aware Speculative Decoding and Online Draft Learning
Yizhou Zhang
Ning Lv
T. Wang
Jisheng Dang
OffRL
LRM
86
1
0
26 Sep 2025
AMLA: MUL by ADD in FlashAttention Rescaling
Qichen Liao
Chengqiu Hu
Fangzheng Miao
Bao Li
Y. Liu
...
Lirui Jiang
Jun-Bo Wang
Lingchao Zheng
Jun Li
Yuwei Fan
68
0
0
24 Sep 2025
An overview of neural architectures for self-supervised audio representation learning from masked spectrograms
Sarthak Yadav
Sergios Theodoridis
Zheng-Hua Tan
Mamba
143
0
0
23 Sep 2025
UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression
Chenlong Deng
Zhisong Zhang
Kelong Mao
Shuaiyi Li
Tianqing Fang
H. Zhang
Haitao Mi
Dong Yu
Zhicheng Dou
106
0
0
19 Sep 2025
Crown, Frame, Reverse: Layer-Wise Scaling Variants for LLM Pre-Training
Andrei Baroian
Kasper Notebomer
84
0
0
08 Sep 2025
Efficient Item ID Generation for Large-Scale LLM-based Recommendation
Anushya Subbiah
Vikram Aggarwal
James Pine
Steffen Rendle
Krishna Sayana
Kun Su
54
0
0
03 Sep 2025
DSDE: Dynamic Speculative Decoding with KLD Stability for Real-World Serving
Mingyu Yang
Jae-Young Choi
Kihyo Moon
Minsung Jang
Eunjoo Jeon
140
0
0
01 Sep 2025
KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache
Bo Jiang
Taolue Yang
Youyuan Liu
Chengming Zhang
Xubin He
Sian Jin
MQ
VLM
79
0
0
30 Aug 2025
ClusterFusion: Expanding Operator Fusion Scope for LLM Inference via Cluster-Level Collective Primitive
Xinhao Luo
Zihan Liu
Yangjie Zhou
Shihan Fang
Ziyu Huang
...
Chen Zhang
Shixuan Sun
Zhenzhe Zheng
Chen Chen
Minyi Guo
VLM
103
1
0
26 Aug 2025
CommonKV: Compressing KV Cache with Cross-layer Parameter Sharing
Yixuan Wang
Haoyu Qiao
Lujun Li
Qingfu Zhu
Wanxiang Che
MQ
100
0
0
22 Aug 2025
SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning
Remote Sensing (RS), 2025
Huanxuan Liao
Yixing Xu
Shizhu He
Guanchen Li
Xuanwu Yin
Dong Li
E. Barsoum
Jun Zhao
Kang Liu
115
1
0
21 Aug 2025
WISCA: A Lightweight Model Transition Method to Improve LLM Training via Weight Scaling
Jiacheng Li
Jianchao Tan
Zhidong Yang
Pingwei Sun
Feiye Huo
...
Xiangyu Zhang
Maoxin He
Guangming Tan
Weile Jia
Tong Zhao
72
3
0
21 Aug 2025
PENGUIN: Enhancing Transformer with Periodic-Nested Group Attention for Long-term Time Series Forecasting
Tian Sun
Yuqi Chen
Weiwei Sun
AI4TS
116
1
0
19 Aug 2025
FuXi-β: Towards a Lightweight and Fast Large-Scale Generative Recommendation Model
Yufei Ye
Wei Guo
Hao Wang
Hong Zhu
Yuyang Ye
Yong Liu
Huifeng Guo
Ruiming Tang
Defu Lian
Tong Xu
86
2
0
14 Aug 2025
READER: Retrieval-Assisted Drafter for Efficient LLM Inference
Maxim Divilkovskiy
Vitaly Malygin
Sergey Zlobin
Sultan Isali
Vasily Kalugin
Stanislav Ilyushin
Nuriza Aitassova
Yi Fei
Zeng Weidi
RALM
100
0
0
12 Aug 2025
Chi-Geometry: A Library for Benchmarking Chirality Prediction of GNNs
Rylie Weaver
Massamiliano Lupo Pasini
60
0
0
12 Aug 2025
Many-Turn Jailbreaking
Xianjun Yang
Liqiang Xiao
Shiyang Li
Faisal Ladhak
Hyokun Yun
Linda R. Petzold
Yi Xu
William Wang
91
0
0
09 Aug 2025
1
2
3
4
5
6
7
8
9
Next