ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1911.02150
  4. Cited By
Fast Transformer Decoding: One Write-Head is All You Need

Fast Transformer Decoding: One Write-Head is All You Need

6 November 2019
Noam M. Shazeer
ArXiv (abs)PDFHTMLHuggingFace (9 upvotes)

Papers citing "Fast Transformer Decoding: One Write-Head is All You Need"

50 / 428 papers shown
Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling
Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token RecyclingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Xianzhen Luo
Yixuan Wang
Qingfu Zhu
Zhiming Zhang
Xuanyu Zhang
Qing Yang
Dongliang Xu
445
23
0
16 Aug 2024
Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention
Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention
Zohaib Khan
Muhammad Khaquan
Omer Tafveez
Burhanuddin Samiwala
Agha Ali Raza
205
3
0
15 Aug 2024
KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft
  Heads with Adversarial Learning
KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial LearningInternational Conference on Computer Supported Cooperative Work in Design (CSCWD), 2024
Kaiqi Zhang
Jing Zhao
Rui Chen
307
5
0
15 Aug 2024
Kraken: Inherently Parallel Transformers For Efficient Multi-Device
  Inference
Kraken: Inherently Parallel Transformers For Efficient Multi-Device InferenceNeural Information Processing Systems (NeurIPS), 2024
R. Prabhakar
Hengrui Zhang
D. Wentzlaff
290
1
0
14 Aug 2024
End-to-end Semantic-centric Video-based Multimodal Affective Computing
End-to-end Semantic-centric Video-based Multimodal Affective Computing
Ronghao Lin
Ying Zeng
Sijie Mai
Haifeng Hu
VGen
282
2
0
14 Aug 2024
Post-Training Sparse Attention with Double Sparsity
Post-Training Sparse Attention with Double Sparsity
Shuo Yang
Ying Sheng
Joseph E. Gonzalez
Ion Stoica
Lianmin Zheng
285
25
0
11 Aug 2024
Eigen Attention: Attention in Low-Rank Space for KV Cache Compression
Eigen Attention: Attention in Low-Rank Space for KV Cache CompressionConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Utkarsh Saxena
Gobinda Saha
Sakshi Choudhary
Kaushik Roy
246
33
0
10 Aug 2024
NACL: A General and Effective KV Cache Eviction Framework for LLMs at
  Inference Time
NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference TimeAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Yilong Chen
Guoxia Wang
Junyuan Shang
Shiyao Cui
Zhenyu Zhang
Tingwen Liu
Shuohuan Wang
Yu Sun
Dianhai Yu
Hua Wu
250
31
0
07 Aug 2024
Cross-layer Attention Sharing for Pre-trained Large Language Models
Cross-layer Attention Sharing for Pre-trained Large Language Models
Yongyu Mu
Yuzhang Wu
Yuchun Fan
Chenglong Wang
Hengyu Li
...
Murun Yang
Fandong Meng
Jie Zhou
Tong Xiao
Jingbo Zhu
262
6
0
04 Aug 2024
JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Model
JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Model
Farzaneh Jafari
Stefano Berretti
Anup Basu
Mamba
449
2
0
03 Aug 2024
What comes after transformers? -- A selective survey connecting ideas in
  deep learning
What comes after transformers? -- A selective survey connecting ideas in deep learning
Johannes Schneider
AI4CE
407
3
0
01 Aug 2024
Efficient Training of Large Language Models on Distributed
  Infrastructures: A Survey
Efficient Training of Large Language Models on Distributed Infrastructures: A Survey
Jiangfei Duan
Shuo Zhang
Zerui Wang
Lijuan Jiang
Wenwen Qu
...
Dahua Lin
Yonggang Wen
Xin Jin
Tianwei Zhang
Yang Liu
363
31
0
29 Jul 2024
Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache
  Consumption
Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption
Shi Luohe
Hongyi Zhang
Yao Yao
Z. Li
Zhao Hai
533
92
0
25 Jul 2024
MINI-SEQUENCE TRANSFORMER: Optimizing Intermediate Memory for Long
  Sequences Training
MINI-SEQUENCE TRANSFORMER: Optimizing Intermediate Memory for Long Sequences Training
Cheng Luo
Jiawei Zhao
Zhuoming Chen
Beidi Chen
A. Anandkumar
264
5
0
22 Jul 2024
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
RazorAttention: Efficient KV Cache Compression Through Retrieval Heads
Hanlin Tang
Yang Lin
Aiyue Chen
Qingsen Han
Shikuan Hong
Jing Lin
Gongyi Wang
MQ
234
57
0
22 Jul 2024
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined
  Speculation
PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation
Branden Butler
Sixing Yu
Arya Mazaheri
Ali Jannesari
LRM
226
19
0
16 Jul 2024
Weighted Grouped Query Attention in Transformers
Weighted Grouped Query Attention in Transformers
Sai Sena Chinnakonduru
Astarag Mohapatra
186
6
0
15 Jul 2024
Investigating Low-Rank Training in Transformer Language Models:
  Efficiency and Scaling Analysis
Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis
Xiuying Wei
Skander Moalla
Razvan Pascanu
Çağlar Gülçehre
231
5
0
13 Jul 2024
Beyond KV Caching: Shared Attention for Efficient LLMs
Beyond KV Caching: Shared Attention for Efficient LLMs
Bingli Liao
Danilo Vasconcellos Vargas
210
9
0
13 Jul 2024
Inference Optimization of Foundation Models on AI Accelerators
Inference Optimization of Foundation Models on AI Accelerators
Youngsuk Park
Kailash Budhathoki
Liangfu Chen
Jonas M. Kübler
Jiaji Huang
Matthäus Kleindessner
Jun Huan
Volkan Cevher
Yida Wang
George Karypis
313
14
0
12 Jul 2024
FlashAttention-3: Fast and Accurate Attention with Asynchrony and
  Low-precision
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
Jay Shah
Ganesh Bikshandi
Ying Zhang
Vijay Thakkar
Pradeep Ramani
Tri Dao
505
321
0
11 Jul 2024
Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of
  Modules
Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules
Zhuocheng Gong
Ang Lv
Jian Guan
Junxi Yan
Wei Wu
Huishuai Zhang
Minlie Huang
Dongyan Zhao
Rui Yan
MoE
174
8
0
09 Jul 2024
Narrow Transformer: Starcoder-Based Java-LM For Desktop
Narrow Transformer: Starcoder-Based Java-LM For Desktop
Kamalkumar Rathinasamy
Balaji A J
Ankush Kumar
Gagan Gayari
Harshini K
Rajab Ali Mondal
S. SreenivasaRaghavanK
Swayam Singh
174
1
0
04 Jul 2024
The Mysterious Case of Neuron 1512: Injectable Realignment Architectures
  Reveal Internal Characteristics of Meta's Llama 2 Model
The Mysterious Case of Neuron 1512: Injectable Realignment Architectures Reveal Internal Characteristics of Meta's Llama 2 Model
Brenden Smith
Dallin Baker
Clayton Chase
Myles Barney
Kaden Parker
Makenna Allred
Peter Hu
Alex Evans
Nancy Fulda
209
0
0
04 Jul 2024
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via
  Dynamic Sparse Attention
MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention
Huiqiang Jiang
Yucheng Li
Chengruidong Zhang
Qianhui Wu
Xufang Luo
...
Amir H. Abdi
Dongsheng Li
Chin-Yew Lin
Yuqing Yang
L. Qiu
328
225
0
02 Jul 2024
KV Cache Compression, But What Must We Give in Return? A Comprehensive
  Benchmark of Long Context Capable Approaches
KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches
Jiayi Yuan
Hongyi Liu
Shaochen
Zhong
Yu-Neng Chuang
...
Hongye Jin
Vipin Chaudhary
Zhaozhuo Xu
Zirui Liu
Xia Hu
305
41
0
01 Jul 2024
WallFacer: Guiding Transformer Model Training Out of the Long-Context
  Dark Forest with N-body Problem
WallFacer: Guiding Transformer Model Training Out of the Long-Context Dark Forest with N-body Problem
Ziming Liu
Shaoyu Wang
Shenggan Cheng
Zhongkai Zhao
Xuanlei Zhao
James Demmel
Yang You
211
1
0
30 Jun 2024
From Decoding to Meta-Generation: Inference-time Algorithms for Large
  Language Models
From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
Sean Welleck
Amanda Bertsch
Matthew Finlayson
Hailey Schoelkopf
Alex Xie
Graham Neubig
Ilia Kulikov
Zaid Harchaoui
374
110
0
24 Jun 2024
Towards Fast Multilingual LLM Inference: Speculative Decoding and
  Specialized Drafters
Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters
Euiin Yi
Taehyeon Kim
Hongseok Jeung
Du-Seong Chang
Se-Young Yun
175
7
0
24 Jun 2024
Building on Efficient Foundations: Effectively Training LLMs with
  Structured Feedforward Layers
Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers
Xiuying Wei
Skander Moalla
Razvan Pascanu
Çağlar Gülçehre
338
4
0
24 Jun 2024
A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems
A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems
Florin Cuconasu
Giovanni Trappolini
Nicola Tonellotto
Fabrizio Silvestri
206
4
0
21 Jun 2024
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All
  Tools
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Team GLM
:
Aohan Zeng
Bin Xu
Bowen Wang
...
Zhaoyu Wang
Zhen Yang
Zhengxiao Du
Zhenyu Hou
Zihan Wang
ALM
371
1,167
0
18 Jun 2024
MCSD: An Efficient Language Model with Diverse Fusion
MCSD: An Efficient Language Model with Diverse Fusion
Hua Yang
Duohai Li
Shiman Li
205
2
0
18 Jun 2024
D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models
D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models
Zhongwei Wan
Xinjian Wu
Yu Zhang
Yi Xin
Chaofan Tao
...
Xin Wang
Siqi Luo
Jing Xiong
Mi Zhang
Mi Zhang
392
5
0
18 Jun 2024
Autoregressive Image Generation without Vector Quantization
Autoregressive Image Generation without Vector Quantization
Tianhong Li
Yonglong Tian
He Li
Mingyang Deng
Kaiming He
DiffM
481
478
0
17 Jun 2024
Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in
  Transformers
Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers
Qian Chen
Wen Wang
Qinglin Zhang
Siqi Zheng
Shiliang Zhang
Chong Deng
Hai Yu
Jiaqing Liu
Yukun Ma
Chong Zhang
146
4
0
17 Jun 2024
Optimized Speculative Sampling for GPU Hardware Accelerators
Optimized Speculative Sampling for GPU Hardware Accelerators
Dominik Wagner
Seanie Lee
Ilja Baumann
Philipp Seeberger
Korbinian Riedhammer
Tobias Bocklet
209
4
0
16 Jun 2024
MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer
  Decoding
MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding
Zayd Muhammad Kawakibi Zuhri
Muhammad Farid Adilazuarda
Ayu Purwarianti
Alham Fikri Aji
252
16
0
13 Jun 2024
Investigating the translation capabilities of Large Language Models
  trained on parallel data only
Investigating the translation capabilities of Large Language Models trained on parallel data only
Javier García Gilabert
Carlos Escolano
Aleix Sant Savall
Francesca de Luca Fornaciari
Audrey Mash
Xixian Liao
Maite Melero
LRM
320
2
0
13 Jun 2024
OPTune: Efficient Online Preference Tuning
OPTune: Efficient Online Preference Tuning
Lichang Chen
Jiuhai Chen
Chenxi Liu
John Kirchenbauer
Davit Soselia
Chen Zhu
Tom Goldstein
Wanrong Zhu
Heng Huang
130
7
0
11 Jun 2024
QuickLLaMA: Query-aware Inference Acceleration for Large Language Models
QuickLLaMA: Query-aware Inference Acceleration for Large Language Models
Jingyao Li
Han Shi
Xin Jiang
Zhenguo Li
Hong Xu
Jiaya Jia
LRM
187
4
0
11 Jun 2024
Effectively Compress KV Heads for LLM
Effectively Compress KV Heads for LLM
Hao Yu
Zelan Yang
Shen Li
Shen Li
Jianxin Wu
MQVLM
166
27
0
11 Jun 2024
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling
Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling
Liliang Ren
Yang Liu
Yadong Lu
Haoran Pan
Chen Liang
Weizhu Chen
Mamba
368
111
0
11 Jun 2024
QCQA: Quality and Capacity-aware grouped Query Attention
QCQA: Quality and Capacity-aware grouped Query Attention
Vinay Joshi
Prashant Laddha
Shambhavi Sinha
O. J. Omer
S. Subramoney
304
5
0
08 Jun 2024
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero
  Overhead
QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead
A. Zandieh
Majid Daliri
Insu Han
MQ
219
17
0
05 Jun 2024
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Block Transformer: Global-to-Local Language Modeling for Fast Inference
Namgyu Ho
Sangmin Bae
Taehyeon Kim
Hyunjik Jo
Yireun Kim
Tal Schuster
Adam Fisch
James Thorne
Se-Young Yun
306
27
0
04 Jun 2024
Universal In-Context Approximation By Prompting Fully Recurrent Models
Universal In-Context Approximation By Prompting Fully Recurrent Models
Aleksandar Petrov
Tom A. Lamb
Alasdair Paren
Juil Sock
Adel Bibi
LRM
180
0
0
03 Jun 2024
DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via
  Adaptive Heads Fusion
DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion
Yilong Chen
Linhao Zhang
Junyuan Shang
Ying Tai
Tingwen Liu
Shuohuan Wang
Yu Sun
172
7
0
03 Jun 2024
An Early Investigation into the Utility of Multimodal Large Language
  Models in Medical Imaging
An Early Investigation into the Utility of Multimodal Large Language Models in Medical Imaging
Sulaiman Khan
Md. Rafiul Biswas
Alina Murad
Hazrat Ali
Zubair Shah
170
6
0
02 Jun 2024
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model
  Series
MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series
Ge Zhang
Scott Qu
Jiaheng Liu
Chenchen Zhang
Chenghua Lin
...
Zi-Kai Zhao
Jiajun Zhang
Wanli Ouyang
Wenhao Huang
Lei Ma
ELM
310
71
0
29 May 2024
Previous
123456789
Next