Fast Transformer Decoding: One Write-Head is All You Need

6 November 2019

ArXiv (abs)PDF HTML HuggingFace (9 upvotes)

Papers citing "Fast Transformer Decoding: One Write-Head is All You Need"

50 / 428 papers shown

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token RecyclingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

445

16 Aug 2024

Beyond Uniform Query Distribution: Key-Driven Grouped Query Attention

205

15 Aug 2024

KOALA: Enhancing Speculative Decoding for LLM via Multi-Layer Draft Heads with Adversarial LearningInternational Conference on Computer Supported Cooperative Work in Design (CSCWD), 2024

Kaiqi Zhang

Jing Zhao

Rui Chen

307

15 Aug 2024

Kraken: Inherently Parallel Transformers For Efficient Multi-Device InferenceNeural Information Processing Systems (NeurIPS), 2024

R. Prabhakar

Hengrui Zhang

D. Wentzlaff

290

14 Aug 2024

End-to-end Semantic-centric Video-based Multimodal Affective Computing

282

14 Aug 2024

Post-Training Sparse Attention with Double Sparsity

Shuo Yang

Ying Sheng

Joseph E. Gonzalez

Ion Stoica

Lianmin Zheng

285

11 Aug 2024

Eigen Attention: Attention in Low-Rank Space for KV Cache CompressionConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Utkarsh Saxena

Gobinda Saha

Sakshi Choudhary

Kaushik Roy

246

10 Aug 2024

NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference TimeAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

Zhenyu Zhang

Yu Sun

250

07 Aug 2024

Cross-layer Attention Sharing for Pre-trained Large Language Models

...

262

04 Aug 2024

JambaTalk: Speech-Driven 3D Talking Head Generation Based on Hybrid Transformer-Mamba Model

449

03 Aug 2024

What comes after transformers? -- A selective survey connecting ideas in deep learning

Johannes Schneider

AI4CE

407

01 Aug 2024

Efficient Training of Large Language Models on Distributed Infrastructures: A Survey

...

Dahua Lin

Yonggang Wen

Xin Jin

Tianwei Zhang

Yang Liu

363

29 Jul 2024

Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption

533

25 Jul 2024

MINI-SEQUENCE TRANSFORMER: Optimizing Intermediate Memory for Long Sequences Training

264

22 Jul 2024

RazorAttention: Efficient KV Cache Compression Through Retrieval Heads

234

22 Jul 2024

PipeInfer: Accelerating LLM Inference using Asynchronous Pipelined Speculation

226

16 Jul 2024

Weighted Grouped Query Attention in Transformers

Sai Sena Chinnakonduru

Astarag Mohapatra

186

15 Jul 2024

Investigating Low-Rank Training in Transformer Language Models: Efficiency and Scaling Analysis

231

13 Jul 2024

Beyond KV Caching: Shared Attention for Efficient LLMs

Bingli Liao

Danilo Vasconcellos Vargas

210

13 Jul 2024

Inference Optimization of Foundation Models on AI Accelerators

Matthäus Kleindessner

313

12 Jul 2024

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

505

321

11 Jul 2024

Mixture-of-Modules: Reinventing Transformers as Dynamic Assemblies of Modules

Huishuai Zhang

Minlie Huang

Dongyan Zhao

Rui Yan

MoE

174

09 Jul 2024

Narrow Transformer: Starcoder-Based Java-LM For Desktop

Kamalkumar Rathinasamy

Balaji A J

Ankush Kumar

Gagan Gayari

Harshini K

Rajab Ali Mondal

S. SreenivasaRaghavanK

Swayam Singh

174

04 Jul 2024

The Mysterious Case of Neuron 1512: Injectable Realignment Architectures Reveal Internal Characteristics of Meta's Llama 2 Model

209

04 Jul 2024

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Chengruidong Zhang

...

328

225

02 Jul 2024

KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches

Shaochen

...

Zhaozhuo Xu

Xia Hu

305

01 Jul 2024

WallFacer: Guiding Transformer Model Training Out of the Long-Context Dark Forest with N-body Problem

James Demmel

211

30 Jun 2024

From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models

Sean Welleck

Ilia Kulikov

Zaid Harchaoui

374

110

24 Jun 2024

Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters

Se-Young Yun

175

24 Jun 2024

Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers

338

24 Jun 2024

A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems

206

21 Jun 2024

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team GLM

Aohan Zeng

Bin Xu

Bowen Wang

...

Zhaoyu Wang

Zhen Yang

Zhengxiao Du

Zhenyu Hou

Zihan Wang

ALM

371

1,167

18 Jun 2024

MCSD: An Efficient Language Model with Diverse Fusion

Hua Yang

Duohai Li

Shiman Li

205

18 Jun 2024

D2O: Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models

Zhongwei Wan

Xinjian Wu

Yu Zhang

Yi Xin

Chaofan Tao

...

392

18 Jun 2024

Autoregressive Image Generation without Vector Quantization

481

478

17 Jun 2024

Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers

Qian Chen

Wen Wang

Qinglin Zhang

Siqi Zheng

Shiliang Zhang

146

17 Jun 2024

Optimized Speculative Sampling for GPU Hardware Accelerators

Seanie Lee

209

16 Jun 2024

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Zayd Muhammad Kawakibi Zuhri

Muhammad Farid Adilazuarda

Ayu Purwarianti

Alham Fikri Aji

252

13 Jun 2024

Investigating the translation capabilities of Large Language Models trained on parallel data only

Javier García Gilabert

Carlos Escolano

Aleix Sant Savall

Francesca de Luca Fornaciari

320

13 Jun 2024

OPTune: Efficient Online Preference Tuning

Tom Goldstein

Heng Huang

130

11 Jun 2024

QuickLLaMA: Query-aware Inference Acceleration for Large Language Models

Jingyao Li

Zhenguo Li

187

11 Jun 2024

Effectively Compress KV Heads for LLM

Zelan Yang

166

11 Jun 2024

Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

Yadong Lu

Weizhu Chen

368

111

11 Jun 2024

QCQA: Quality and Capacity-aware grouped Query Attention

304

08 Jun 2024

QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead

A. Zandieh

Majid Daliri

Insu Han

219

05 Jun 2024

Block Transformer: Global-to-Local Language Modeling for Fast Inference

Se-Young Yun

306

04 Jun 2024

Universal In-Context Approximation By Prompting Fully Recurrent Models

180

03 Jun 2024

DHA: Learning Decoupled-Head Attention from Transformer Checkpoints via Adaptive Heads Fusion

172

03 Jun 2024

An Early Investigation into the Utility of Multimodal Large Language Models in Medical Imaging

Md. Rafiul Biswas

170

02 Jun 2024

MAP-Neo: Highly Capable and Transparent Bilingual Large Language Model Series

Ge Zhang

Jiaheng Liu

...

Wanli Ouyang

310

29 May 2024