ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.10438
  4. Cited By
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large
  Language Models

SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

18 November 2022
Guangxuan Xiao
Ji Lin
Mickael Seznec
Hao Wu
Julien Demouth
Song Han
    MQ
ArXivPDFHTML

Papers citing "SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models"

50 / 526 papers shown
Title
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric
  Strategy for Diverse Generative Tasks
ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks
Xiaoxia Wu
Haojun Xia
Stephen Youn
Zhen Zheng
Shiyang Chen
...
Reza Yazdani Aminabadi
Yuxiong He
Olatunji Ruwase
Leon Song
Zhewei Yao
66
8
0
14 Dec 2023
CBQ: Cross-Block Quantization for Large Language Models
CBQ: Cross-Block Quantization for Large Language Models
Xin Ding
Xiaoyu Liu
Zhijun Tu
Yun-feng Zhang
Wei Li
...
Hanting Chen
Yehui Tang
Zhiwei Xiong
Baoqun Yin
Yunhe Wang
MQ
27
11
0
13 Dec 2023
Efficient Quantization Strategies for Latent Diffusion Models
Efficient Quantization Strategies for Latent Diffusion Models
Yuewei Yang
Xiaoliang Dai
Jialiang Wang
Peizhao Zhang
Hongbo Zhang
DiffM
MQ
22
13
0
09 Dec 2023
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language
  Models with 3D Parallelism
EE-LLM: Large-Scale Training and Inference of Early-Exit Large Language Models with 3D Parallelism
Yanxi Chen
Xuchen Pan
Yaliang Li
Bolin Ding
Jingren Zhou
LRM
21
31
0
08 Dec 2023
SmoothQuant+: Accurate and Efficient 4-bit Post-Training
  WeightQuantization for LLM
SmoothQuant+: Accurate and Efficient 4-bit Post-Training WeightQuantization for LLM
Jiayi Pan
Chengcan Wang
Kaifu Zheng
Yangguang Li
Zhenyu Wang
Bin Feng
MQ
30
7
0
06 Dec 2023
Nonparametric Variational Regularisation of Pretrained Transformers
Nonparametric Variational Regularisation of Pretrained Transformers
Fabio Fehr
James Henderson
35
0
0
01 Dec 2023
Deepfakes, Misinformation, and Disinformation in the Era of Frontier AI,
  Generative AI, and Large AI Models
Deepfakes, Misinformation, and Disinformation in the Era of Frontier AI, Generative AI, and Large AI Models
Mohamed R. Shoaib
Ze Wang
Milad Taleby Ahvanooey
Jun Zhao
20
38
0
29 Nov 2023
Shedding the Bits: Pushing the Boundaries of Quantization with
  Minifloats on FPGAs
Shedding the Bits: Pushing the Boundaries of Quantization with Minifloats on FPGAs
Shivam Aggarwal
Hans Jakob Damsgaard
Alessandro Pappalardo
Giuseppe Franco
Thomas B. Preußer
Michaela Blott
Tulika Mitra
MQ
14
5
0
21 Nov 2023
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient
  Language Model Finetuning
LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning
Han Guo
P. Greengard
Eric P. Xing
Yoon Kim
MQ
28
43
0
20 Nov 2023
Zero redundancy distributed learning with differential privacy
Zero redundancy distributed learning with differential privacy
Zhiqi Bu
Justin Chiu
Ruixuan Liu
Sheng Zha
George Karypis
38
9
0
20 Nov 2023
HexGen: Generative Inference of Large Language Model over Heterogeneous
  Environment
HexGen: Generative Inference of Large Language Model over Heterogeneous Environment
Youhe Jiang
Ran Yan
Xiaozhe Yao
Yang Zhou
Beidi Chen
Binhang Yuan
SyDa
19
10
0
20 Nov 2023
A Speed Odyssey for Deployable Quantization of LLMs
A Speed Odyssey for Deployable Quantization of LLMs
Qingyuan Li
Ran Meng
Yiduo Li
Bo-Wen Zhang
Liang Li
Yifan Lu
Xiangxiang Chu
Yerui Sun
Yuchen Xie
MQ
56
7
0
16 Nov 2023
Towards the Law of Capacity Gap in Distilling Language Models
Towards the Law of Capacity Gap in Distilling Language Models
Chen Zhang
Dawei Song
Zheyu Ye
Yan Gao
ELM
23
20
0
13 Nov 2023
AI-native Interconnect Framework for Integration of Large Language Model
  Technologies in 6G Systems
AI-native Interconnect Framework for Integration of Large Language Model Technologies in 6G Systems
Sasu Tarkoma
Roberto Morabito
Jaakko Sauvola
15
19
0
10 Nov 2023
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
Ying Sheng
Shiyi Cao
Dacheng Li
Coleman Hooper
Nicholas Lee
...
Banghua Zhu
Lianmin Zheng
Kurt Keutzer
Joseph E. Gonzalez
Ion Stoica
MoE
26
87
0
06 Nov 2023
AWEQ: Post-Training Quantization with Activation-Weight Equalization for
  Large Language Models
AWEQ: Post-Training Quantization with Activation-Weight Equalization for Large Language Models
Baisong Li
Xingwang Wang
Haixiao Xu
MQ
14
0
0
02 Nov 2023
Efficient LLM Inference on CPUs
Efficient LLM Inference on CPUs
Haihao Shen
Hanwen Chang
Bo Dong
Yu Luo
Hengyu Meng
MQ
15
17
0
01 Nov 2023
SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and
  Scalable Large Mixture-of-Experts Models
SiDA-MoE: Sparsity-Inspired Data-Aware Serving for Efficient and Scalable Large Mixture-of-Experts Models
Zhixu Du
Shiyu Li
Yuhao Wu
Xiangyu Jiang
Jingwei Sun
Qilin Zheng
Yongkai Wu
Ang Li
Hai Helen Li
Yiran Chen
MoE
23
12
0
29 Oct 2023
Punica: Multi-Tenant LoRA Serving
Punica: Multi-Tenant LoRA Serving
Lequn Chen
Zihao Ye
Yongji Wu
Danyang Zhuo
Luis Ceze
Arvind Krishnamurthy
34
34
0
28 Oct 2023
ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training
  Quantization Framework for W8A8 Transformers
ZeroQuant-HERO: Hardware-Enhanced Robust Optimized Post-Training Quantization Framework for W8A8 Transformers
Zhewei Yao
Reza Yazdani Aminabadi
Stephen Youn
Xiaoxia Wu
Elton Zheng
Yuxiong He
MQ
19
1
0
26 Oct 2023
FedPEAT: Convergence of Federated Learning, Parameter-Efficient Fine
  Tuning, and Emulator Assisted Tuning for Artificial Intelligence Foundation
  Models with Mobile Edge Computing
FedPEAT: Convergence of Federated Learning, Parameter-Efficient Fine Tuning, and Emulator Assisted Tuning for Artificial Intelligence Foundation Models with Mobile Edge Computing
Terence Jie Chua
Wen-li Yu
Junfeng Zhao
Kwok-Yan Lam
FedML
24
5
0
26 Oct 2023
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
Zichang Liu
Jue Wang
Tri Dao
Tianyi Zhou
Binhang Yuan
...
Anshumali Shrivastava
Ce Zhang
Yuandong Tian
Christopher Ré
Beidi Chen
BDL
17
191
0
26 Oct 2023
LLM Performance Predictors are good initializers for Architecture Search
LLM Performance Predictors are good initializers for Architecture Search
Ganesh Jawahar
Muhammad Abdul-Mageed
L. Lakshmanan
Dujian Ding
LRM
43
17
0
25 Oct 2023
Vision Language Models in Autonomous Driving: A Survey and Outlook
Vision Language Models in Autonomous Driving: A Survey and Outlook
Xingcheng Zhou
Mingyu Liu
Ekim Yurtsever
B. L. Žagar
Walter Zimmer
Hu Cao
Alois C. Knoll
VLM
25
33
0
22 Oct 2023
BitNet: Scaling 1-bit Transformers for Large Language Models
BitNet: Scaling 1-bit Transformers for Large Language Models
Hongyu Wang
Shuming Ma
Li Dong
Shaohan Huang
Huaijie Wang
Lingxiao Ma
Fan Yang
Ruiping Wang
Yi Wu
Furu Wei
MQ
20
95
0
17 Oct 2023
TEQ: Trainable Equivalent Transformation for Quantization of LLMs
TEQ: Trainable Equivalent Transformation for Quantization of LLMs
Wenhua Cheng
Yiyang Cai
Kaokao Lv
Haihao Shen
MQ
12
7
0
17 Oct 2023
One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language
  Models
One-Shot Sensitivity-Aware Mixed Sparsity Pruning for Large Language Models
Hang Shao
Bei Liu
Bo Xiao
Ke Zeng
Guanglu Wan
Yanmin Qian
42
17
0
14 Oct 2023
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language
  Models
QUIK: Towards End-to-End 4-Bit Inference on Generative Large Language Models
Saleh Ashkboos
Ilia Markov
Elias Frantar
Tingxuan Zhong
Xincheng Wang
Jie Ren
Torsten Hoefler
Dan Alistarh
MQ
SyDa
117
21
0
13 Oct 2023
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs
Dynamic Sparse No Training: Training-Free Fine-tuning for Sparse LLMs
Yu-xin Zhang
Lirui Zhao
Mingbao Lin
Yunyun Sun
Yiwu Yao
Xingjia Han
Jared Tanner
Shiwei Liu
Rongrong Ji
SyDa
29
40
0
13 Oct 2023
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
Yixiao Li
Yifan Yu
Chen Liang
Pengcheng He
Nikos Karampatziakis
Weizhu Chen
Tuo Zhao
MQ
33
121
0
12 Oct 2023
QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large
  Language Models
QLLM: Accurate and Efficient Low-Bitwidth Quantization for Large Language Models
Jing Liu
Ruihao Gong
Xiuying Wei
Zhiwei Dong
Jianfei Cai
Bohan Zhuang
MQ
20
49
0
12 Oct 2023
CacheGen: KV Cache Compression and Streaming for Fast Language Model
  Serving
CacheGen: KV Cache Compression and Streaming for Fast Language Model Serving
Yuhan Liu
Hanchen Li
Yihua Cheng
Siddhant Ray
Yuyang Huang
...
Ganesh Ananthanarayanan
Michael Maire
Henry Hoffmann
Ari Holtzman
Junchen Jiang
50
41
0
11 Oct 2023
Sparse Fine-tuning for Inference Acceleration of Large Language Models
Sparse Fine-tuning for Inference Acceleration of Large Language Models
Eldar Kurtic
Denis Kuznedelev
Elias Frantar
Michael Goin
Dan Alistarh
27
8
0
10 Oct 2023
Sheared LLaMA: Accelerating Language Model Pre-training via Structured
  Pruning
Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning
Mengzhou Xia
Tianyu Gao
Zhiyuan Zeng
Danqi Chen
24
262
0
10 Oct 2023
LLMLingua: Compressing Prompts for Accelerated Inference of Large
  Language Models
LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models
Huiqiang Jiang
Qianhui Wu
Chin-Yew Lin
Yuqing Yang
Lili Qiu
24
100
0
09 Oct 2023
Fast and Robust Early-Exiting Framework for Autoregressive Language
  Models with Synchronized Parallel Decoding
Fast and Robust Early-Exiting Framework for Autoregressive Language Models with Synchronized Parallel Decoding
Sangmin Bae
Jongwoo Ko
Hwanjun Song
SeYoung Yun
22
53
0
09 Oct 2023
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for
  Pruning LLMs to High Sparsity
Outlier Weighed Layerwise Sparsity (OWL): A Missing Secret Sauce for Pruning LLMs to High Sparsity
Lu Yin
You Wu
Zhenyu (Allen) Zhang
Cheng-Yu Hsieh
Yaqing Wang
...
Mykola Pechenizkiy
Yi Liang
Michael Bendersky
Zhangyang Wang
Shiwei Liu
20
78
0
08 Oct 2023
Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM
  Inference?
Revisiting Block-based Quantisation: What is Important for Sub-8-bit LLM Inference?
Cheng Zhang
Jianyi Cheng
Ilia Shumailov
G. Constantinides
Yiren Zhao
MQ
14
8
0
08 Oct 2023
Compresso: Structured Pruning with Collaborative Prompting Learns
  Compact Large Language Models
Compresso: Structured Pruning with Collaborative Prompting Learns Compact Large Language Models
Song Guo
Jiahang Xu
Li Lyna Zhang
Mao Yang
17
14
0
08 Oct 2023
Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM
Dual Grained Quantization: Efficient Fine-Grained Quantization for LLM
Luoming Zhang
Wen Fei
Weijia Wu
Yefei He
Zhenyu Lou
Hong Zhou
MQ
17
5
0
07 Oct 2023
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language
  Models
ReLU Strikes Back: Exploiting Activation Sparsity in Large Language Models
Iman Mirzadeh
Keivan Alizadeh-Vahid
Sachin Mehta
C. C. D. Mundo
Oncel Tuzel
Golnoosh Samei
Mohammad Rastegari
Mehrdad Farajtabar
118
59
0
06 Oct 2023
How to Capture Higher-order Correlations? Generalizing Matrix Softmax
  Attention to Kronecker Computation
How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation
Josh Alman
Zhao-quan Song
24
31
0
06 Oct 2023
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language
  Models
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models
Yi-Lin Sung
Jaehong Yoon
Mohit Bansal
VLM
15
14
0
04 Oct 2023
BTR: Binary Token Representations for Efficient Retrieval Augmented
  Language Models
BTR: Binary Token Representations for Efficient Retrieval Augmented Language Models
Qingqing Cao
Sewon Min
Yizhong Wang
Hannaneh Hajishirzi
MQ
RALM
23
4
0
02 Oct 2023
Do Compressed LLMs Forget Knowledge? An Experimental Study with
  Practical Implications
Do Compressed LLMs Forget Knowledge? An Experimental Study with Practical Implications
Duc N. M. Hoang
Minsik Cho
Thomas Merth
Mohammad Rastegari
Zhangyang Wang
KELM
CLL
17
3
0
02 Oct 2023
Efficient Streaming Language Models with Attention Sinks
Efficient Streaming Language Models with Attention Sinks
Michel Lang
Yuandong Tian
Beidi Chen
Song Han
Mike Lewis
AI4TS
RALM
25
639
0
29 Sep 2023
PB-LLM: Partially Binarized Large Language Models
PB-LLM: Partially Binarized Large Language Models
Yuzhang Shang
Zhihang Yuan
Qiang Wu
Zhen Dong
MQ
13
43
0
29 Sep 2023
Training and inference of large language models using 8-bit floating
  point
Training and inference of large language models using 8-bit floating point
Sergio P. Perez
Yan Zhang
James Briggs
Charlie Blake
P. Krishnamurthy
Paul Balanca
Carlo Luschi
Stephen Barlow
Andrew William Fitzgibbon
MQ
19
18
0
29 Sep 2023
ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with
  Modular Quantizers
ModuLoRA: Finetuning 2-Bit LLMs on Consumer GPUs by Integrating with Modular Quantizers
Junjie Yin
Jiahao Dong
Yingheng Wang
Christopher De Sa
Volodymyr Kuleshov
MQ
21
4
0
28 Sep 2023
Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models
Rethinking Channel Dimensions to Isolate Outliers for Low-bit Weight Quantization of Large Language Models
Jung Hwan Heo
Jeonghoon Kim
Beomseok Kwon
Byeongwook Kim
Se Jung Kwon
Dongsoo Lee
MQ
38
9
0
27 Sep 2023
Previous
123...101189
Next