ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.05102
  4. Cited By
Efficiently Scaling Transformer Inference

Efficiently Scaling Transformer Inference

9 November 2022
Reiner Pope
Sholto Douglas
Aakanksha Chowdhery
Jacob Devlin
James Bradbury
Anselm Levskaya
Jonathan Heek
Kefan Xiao
Shivani Agrawal
J. Dean
ArXivPDFHTML

Papers citing "Efficiently Scaling Transformer Inference"

39 / 39 papers shown
Title
EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices
EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices
Arnab Sanyal
Prithwish Mukherjee
Gourav Datta
Sandeep P. Chinchali
MQ
22
0
0
05 May 2025
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
Y. Chen
J. Zhang
Baotong Lu
Qianxi Zhang
Chengruidong Zhang
...
Chen Chen
Mingxing Zhang
Yuqing Yang
Fan Yang
Mao Yang
32
0
0
05 May 2025
Cobra: Efficient Line Art COlorization with BRoAder References
Cobra: Efficient Line Art COlorization with BRoAder References
Junhao Zhuang
Lingen Li
Xuan Ju
Zhaoyang Zhang
C. Yuan
Ying Shan
DiffM
55
0
0
16 Apr 2025
MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications
MSCCL++: Rethinking GPU Communication Abstractions for Cutting-edge AI Applications
Aashaka Shah
Abhinav Jangda
B. Li
Caio Rocha
Changho Hwang
...
Peng Cheng
Qinghua Zhou
Roshan Dathathri
Saeed Maleki
Ziyue Yang
GNN
44
0
0
11 Apr 2025
Adaptive Layer-skipping in Pre-trained LLMs
Adaptive Layer-skipping in Pre-trained LLMs
Xuan Luo
Weizhi Wang
Xifeng Yan
52
0
0
31 Mar 2025
Seesaw: High-throughput LLM Inference via Model Re-sharding
Qidong Su
Wei Zhao
X. Li
Muralidhar Andoorveedu
Chenhao Jiang
Zhanda Zhu
Kevin Song
Christina Giannoula
Gennady Pekhimenko
LRM
68
0
0
09 Mar 2025
MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation
MALT Diffusion: Memory-Augmented Latent Transformers for Any-Length Video Generation
Sihyun Yu
Meera Hahn
Dan Kondratyuk
Jinwoo Shin
Agrim Gupta
José Lezama
Irfan Essa
David A. Ross
Jonathan Huang
DiffM
VGen
65
0
0
18 Feb 2025
Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference
Boosting Multimodal Large Language Models with Visual Tokens Withdrawal for Rapid Inference
Zhihang Lin
Mingbao Lin
Luxi Lin
Rongrong Ji
46
16
0
28 Jan 2025
HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
HyGen: Efficient LLM Serving via Elastic Online-Offline Request Co-location
Ting Sun
Penghan Wang
Fan Lai
41
1
0
15 Jan 2025
Unifying KV Cache Compression for Large Language Models with LeanKV
Unifying KV Cache Compression for Large Language Models with LeanKV
Yanqi Zhang
Yuwei Hu
Runyuan Zhao
John C. S. Lui
Haibo Chen
MQ
83
5
0
04 Dec 2024
Context Parallelism for Scalable Million-Token Inference
Context Parallelism for Scalable Million-Token Inference
Amy Yang
Jingyi Yang
Aya Ibrahim
Xinfeng Xie
Bangsheng Tang
Grigory Sizov
Jeremy Reizenstein
Jongsoo Park
Jianyu Huang
MoE
LRM
42
5
0
04 Nov 2024
Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA
Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA
Sangmin Bae
Adam Fisch
Hrayr Harutyunyan
Ziwei Ji
Seungyeon Kim
Tal Schuster
KELM
57
5
0
28 Oct 2024
SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration
SWIFT: On-the-Fly Self-Speculative Decoding for LLM Inference Acceleration
Heming Xia
Yongqi Li
Jun Zhang
Cunxiao Du
Wenjie Li
LRM
36
4
0
09 Oct 2024
ControlAR: Controllable Image Generation with Autoregressive Models
ControlAR: Controllable Image Generation with Autoregressive Models
Zongming Li
Tianheng Cheng
Shoufa Chen
Peize Sun
Haocheng Shen
Longjin Ran
Xiaoxin Chen
Wenyu Liu
Xinggang Wang
DiffM
132
14
0
03 Oct 2024
Discrete Copula Diffusion
Discrete Copula Diffusion
Anji Liu
Oliver Broadrick
Mathias Niepert
Guy Van den Broeck
DiffM
43
4
0
02 Oct 2024
Confidential Prompting: Protecting User Prompts from Cloud LLM Providers
Confidential Prompting: Protecting User Prompts from Cloud LLM Providers
In Gim
Caihua Li
Lin Zhong
35
2
0
27 Sep 2024
GraphEx: A Graph-based Extraction Method for Advertiser Keyphrase Recommendation
GraphEx: A Graph-based Extraction Method for Advertiser Keyphrase Recommendation
Ashirbad Mishra
Soumik Dey
Marshall Wu
Jinyu Zhao
He Yu
Kaichen Ni
Binbin Li
Kamesh Madduri
36
1
0
05 Sep 2024
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
LazyLLM: Dynamic Token Pruning for Efficient Long Context LLM Inference
Qichen Fu
Minsik Cho
Thomas Merth
Sachin Mehta
Mohammad Rastegari
Mahyar Najibi
25
25
0
19 Jul 2024
Let the Code LLM Edit Itself When You Edit the Code
Let the Code LLM Edit Itself When You Edit the Code
Zhenyu He
Jun Zhang
Shengjie Luo
Jingjing Xu
Z. Zhang
Di He
KELM
23
0
0
03 Jul 2024
OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure
OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure
Jikai Wang
Yi Su
Juntao Li
Qingrong Xia
Zi Ye
Xinyu Duan
Zhefeng Wang
Min Zhang
26
11
0
25 Jun 2024
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers
Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers
Rya Sanovar
Srikant Bharadwaj
Renée St. Amant
Victor Rühle
Saravan Rajmohan
36
6
0
17 May 2024
SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and
  Composition of Experts
SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts
R. Prabhakar
R. Sivaramakrishnan
Darshan Gandhi
Yun Du
Mingran Wang
...
Urmish Thakker
Dawei Huang
Sumti Jairath
Kevin J. Brown
K. Olukotun
MoE
28
12
0
13 May 2024
Allo: A Programming Model for Composable Accelerator Design
Allo: A Programming Model for Composable Accelerator Design
Hongzheng Chen
Niansong Zhang
Shaojie Xiang
Zhichen Zeng
Mengjia Dai
Zhiru Zhang
32
14
0
07 Apr 2024
Yi: Open Foundation Models by 01.AI
Yi: Open Foundation Models by 01.AI
01. AI
Alex Young
01.AI Alex Young
Bei Chen
Chao Li
...
Yue Wang
Yuxuan Cai
Zhenyu Gu
Zhiyuan Liu
Zonghong Dai
OSLM
LRM
100
490
0
07 Mar 2024
Medusa: Simple LLM Inference Acceleration Framework with Multiple
  Decoding Heads
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Tianle Cai
Yuhong Li
Zhengyang Geng
Hongwu Peng
Jason D. Lee
De-huai Chen
Tri Dao
20
243
0
19 Jan 2024
Splitwise: Efficient generative LLM inference using phase splitting
Splitwise: Efficient generative LLM inference using phase splitting
Pratyush Patel
Esha Choukse
Chaojie Zhang
Aashaka Shah
Íñigo Goiri
Saeed Maleki
Ricardo Bianchini
19
194
0
30 Nov 2023
PrivateLoRA For Efficient Privacy Preserving LLM
PrivateLoRA For Efficient Privacy Preserving LLM
Yiming Wang
Yu Lin
Xiaodong Zeng
Guannan Zhang
26
11
0
23 Nov 2023
RecycleGPT: An Autoregressive Language Model with Recyclable Module
RecycleGPT: An Autoregressive Language Model with Recyclable Module
Yu Jiang
Qiaozhi He
Xiaomin Zhuang
Zhihua Wu
Kunpeng Wang
Wenlai Zhao
Guangwen Yang
KELM
10
3
0
07 Aug 2023
Llama 2: Open Foundation and Fine-Tuned Chat Models
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron
Louis Martin
Kevin R. Stone
Peter Albert
Amjad Almahairi
...
Sharan Narang
Aurelien Rodriguez
Robert Stojnic
Sergey Edunov
Thomas Scialom
AI4MH
ALM
29
10,890
0
18 Jul 2023
PowerFusion: A Tensor Compiler with Explicit Data Movement Description
  and Instruction-level Graph IR
PowerFusion: A Tensor Compiler with Explicit Data Movement Description and Instruction-level Graph IR
Zixuan Ma
Haojie Wang
Jingze Xing
Liyan Zheng
Chen Zhang
Huanqi Cao
Kezhao Huang
Shizhi Tang
Penghan Wang
Jidong Zhai
GNN
24
1
0
11 Jul 2023
S$^{3}$: Increasing GPU Utilization during Generative Inference for
  Higher Throughput
S3^{3}3: Increasing GPU Utilization during Generative Inference for Higher Throughput
Yunho Jin
Chun-Feng Wu
David Brooks
Gu-Yeon Wei
6
61
0
09 Jun 2023
Dynamic Context Pruning for Efficient and Interpretable Autoregressive
  Transformers
Dynamic Context Pruning for Efficient and Interpretable Autoregressive Transformers
Sotiris Anagnostidis
Dario Pavllo
Luca Biggio
Lorenzo Noci
Aurélien Lucchi
Thomas Hofmann
24
51
0
25 May 2023
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM
  Inference Pipeline
Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline
Zangwei Zheng
Xiaozhe Ren
Fuzhao Xue
Yang Luo
Xin Jiang
Yang You
11
53
0
22 May 2023
With Shared Microexponents, A Little Shifting Goes a Long Way
With Shared Microexponents, A Little Shifting Goes a Long Way
Bita Darvish Rouhani
Ritchie Zhao
V. Elango
Rasoul Shafipour
Mathew Hall
...
Eric S. Chung
Zhaoxia Deng
S. Naghshineh
Jongsoo Park
Maxim Naumov
MQ
23
36
0
16 Feb 2023
Scaling Vision Transformers to 22 Billion Parameters
Scaling Vision Transformers to 22 Billion Parameters
Mostafa Dehghani
Josip Djolonga
Basil Mustafa
Piotr Padlewski
Jonathan Heek
...
Mario Luvcić
Xiaohua Zhai
Daniel Keysers
Jeremiah Harmsen
N. Houlsby
MLLM
24
562
0
10 Feb 2023
Beyond Distillation: Task-level Mixture-of-Experts for Efficient
  Inference
Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference
Sneha Kudugunta
Yanping Huang
Ankur Bapna
M. Krikun
Dmitry Lepikhin
Minh-Thang Luong
Orhan Firat
MoE
119
87
0
24 Sep 2021
Efficient Content-Based Sparse Attention with Routing Transformers
Efficient Content-Based Sparse Attention with Routing Transformers
Aurko Roy
M. Saffar
Ashish Vaswani
David Grangier
MoE
228
502
0
12 Mar 2020
Scaling Laws for Neural Language Models
Scaling Laws for Neural Language Models
Jared Kaplan
Sam McCandlish
T. Henighan
Tom B. Brown
B. Chess
R. Child
Scott Gray
Alec Radford
Jeff Wu
Dario Amodei
220
3,054
0
23 Jan 2020
Megatron-LM: Training Multi-Billion Parameter Language Models Using
  Model Parallelism
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
M. Shoeybi
M. Patwary
Raul Puri
P. LeGresley
Jared Casper
Bryan Catanzaro
MoE
243
1,791
0
17 Sep 2019
1