Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2306.14048
Cited By
H
2
_2
2
O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
24 June 2023
Zhenyu (Allen) Zhang
Ying Sheng
Tianyi Zhou
Tianlong Chen
Lianmin Zheng
Ruisi Cai
Zhao-quan Song
Yuandong Tian
Christopher Ré
Clark W. Barrett
Zhangyang Wang
Beidi Chen
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models"
36 / 186 papers shown
Title
ZipCache: Accurate and Efficient KV Cache Quantization with Salient Token Identification
Yefei He
Luoming Zhang
Weijia Wu
Jing Liu
Hong Zhou
Bohan Zhuang
MQ
35
24
0
23 May 2024
Reducing Transformer Key-Value Cache Size with Cross-Layer Attention
William Brandon
Mayank Mishra
Aniruddha Nrusimha
Rameswar Panda
Jonathan Ragan-Kelley
MQ
32
38
0
21 May 2024
Challenges in Deploying Long-Context Transformers: A Theoretical Peak Performance Analysis
Yao Fu
22
19
0
14 May 2024
SKVQ: Sliding-window Key and Value Cache Quantization for Large Language Models
Haojie Duanmu
Zhihang Yuan
Xiuhong Li
Jiangfei Duan
Xingcheng Zhang
Dahua Lin
MQ
27
18
0
10 May 2024
KV Cache is 1 Bit Per Channel: Efficient Large Language Model Inference with Coupled Quantization
Tianyi Zhang
Jonah Yi
Zhaozhuo Xu
Anshumali Shrivastava
MQ
29
25
0
07 May 2024
CORM: Cache Optimization with Recent Message for Large Language Model Inference
Jincheng Dai
Zhuowei Huang
Haiyun Jiang
Chen Chen
Deng Cai
Wei Bi
Shuming Shi
19
3
0
24 Apr 2024
Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity
Tyler Griggs
Xiaoxuan Liu
Jiaxiang Yu
Doyoung Kim
Wei-Lin Chiang
Alvin Cheung
Ion Stoica
37
14
0
22 Apr 2024
SnapKV: LLM Knows What You are Looking for Before Generation
Yuhong Li
Yingbing Huang
Bowen Yang
Bharat Venkitesh
Acyr F. Locatelli
Hanchen Ye
Tianle Cai
Patrick Lewis
Deming Chen
VLM
75
148
0
22 Apr 2024
A Survey on Efficient Inference for Large Language Models
Zixuan Zhou
Xuefei Ning
Ke Hong
Tianyu Fu
Jiaming Xu
...
Shengen Yan
Guohao Dai
Xiao-Ping Zhang
Yuhan Dong
Yu-Xiang Wang
46
78
0
22 Apr 2024
TriForce: Lossless Acceleration of Long Sequence Generation with Hierarchical Speculative Decoding
Hanshi Sun
Zhuoming Chen
Xinyu Yang
Yuandong Tian
Beidi Chen
33
46
0
18 Apr 2024
LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism
Bingya Wu
Shengyu Liu
Yinmin Zhong
Peng Sun
Xuanzhe Liu
Xin Jin
RALM
30
49
0
15 Apr 2024
LLoCO: Learning Long Contexts Offline
Sijun Tan
Xiuyu Li
Shishir G. Patil
Ziyang Wu
Tianjun Zhang
Kurt Keutzer
Joseph E. Gonzalez
Raluca A. Popa
RALM
OffRL
LLMAG
35
6
0
11 Apr 2024
Superposition Prompting: Improving and Accelerating Retrieval-Augmented Generation
Thomas Merth
Qichen Fu
Mohammad Rastegari
Mahyar Najibi
LRM
RALM
29
8
0
10 Apr 2024
SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget
Zihao Wang
Shaoduo Gan
30
6
0
07 Apr 2024
BiSHop: Bi-Directional Cellular Learning for Tabular Data with Generalized Sparse Modern Hopfield Model
Chenwei Xu
Yu-Chao Huang
Jerry Yao-Chieh Hu
Weijian Li
Ammar Gilani
H. Goan
Han Liu
37
19
0
04 Apr 2024
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching
Youpeng Zhao
Di Wu
Jun Wang
21
25
0
26 Mar 2024
Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey
Zeyu Han
Chao Gao
Jinyang Liu
Jeff Zhang
Sai Qian Zhang
136
301
0
21 Mar 2024
Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
Piotr Nawrot
Adrian Lañcucki
Marcin Chochowski
David Tarjan
E. Ponti
25
50
0
14 Mar 2024
LLM Inference Unveiled: Survey and Roofline Model Insights
Zhihang Yuan
Yuzhang Shang
Yang Zhou
Zhen Dong
Zhe Zhou
...
Yong Jae Lee
Yan Yan
Beidi Chen
Guangyu Sun
Kurt Keutzer
37
77
0
26 Feb 2024
SubGen: Token Generation in Sublinear Time and Memory
A. Zandieh
Insu Han
Vahab Mirrokni
Amin Karbasi
18
15
0
08 Feb 2024
InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory
Chaojun Xiao
Pengle Zhang
Xu Han
Guangxuan Xiao
Yankai Lin
Zhengyan Zhang
Zhiyuan Liu
Maosong Sun
LLMAG
39
33
0
07 Feb 2024
Training and Serving System of Foundation Models: A Comprehensive Survey
Jiahang Zhou
Yanyu Chen
Zicong Hong
Wuhui Chen
Yue Yu
Tao Zhang
Hui Wang
Chuan-fu Zhang
Zibin Zheng
ALM
17
5
0
05 Jan 2024
Punica: Multi-Tenant LoRA Serving
Lequn Chen
Zihao Ye
Yongji Wu
Danyang Zhuo
Luis Ceze
Arvind Krishnamurthy
26
34
0
28 Oct 2023
How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation
Josh Alman
Zhao-quan Song
21
31
0
06 Oct 2023
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
Suyu Ge
Yunan Zhang
Liyuan Liu
Minjia Zhang
Jiawei Han
Jianfeng Gao
4
213
0
03 Oct 2023
GradientCoin: A Peer-to-Peer Decentralized Large Language Models
Yeqi Gao
Zhao-quan Song
Junze Yin
21
18
0
21 Aug 2023
Zero-th Order Algorithm for Softmax Attention Optimization
Yichuan Deng
Zhihang Li
Sridhar Mahadevan
Zhao-quan Song
27
13
0
17 Jul 2023
Fast Quantum Algorithm for Attention Computation
Yeqi Gao
Zhao-quan Song
Xin Yang
Ruizhe Zhang
LRM
23
19
0
16 Jul 2023
An Iterative Algorithm for Rescaled Hyperbolic Functions Regression
Yeqi Gao
Zhao-quan Song
Junze Yin
23
33
0
01 May 2023
Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
Jingfeng Yang
Hongye Jin
Ruixiang Tang
Xiaotian Han
Qizhang Feng
Haoming Jiang
Bing Yin
Xia Hu
LM&MA
123
593
0
26 Apr 2023
Do Transformers Parse while Predicting the Masked Word?
Haoyu Zhao
A. Panigrahi
Rong Ge
Sanjeev Arora
74
29
0
14 Mar 2023
FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
Ying Sheng
Lianmin Zheng
Binhang Yuan
Zhuohan Li
Max Ryabinin
...
Joseph E. Gonzalez
Percy Liang
Christopher Ré
Ion Stoica
Ce Zhang
144
365
0
13 Mar 2023
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
315
8,261
0
28 Jan 2022
Sequence Length is a Domain: Length-based Overfitting in Transformer Models
Dusan Varis
Ondrej Bojar
45
56
0
15 Sep 2021
Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks
Torsten Hoefler
Dan Alistarh
Tal Ben-Nun
Nikoli Dryden
Alexandra Peste
MQ
131
679
0
31 Jan 2021
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
294
6,927
0
20 Apr 2018
Previous
1
2
3
4