ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.12017
  4. Cited By
Understanding INT4 Quantization for Transformer Models: Latency Speedup,
  Composability, and Failure Cases
v1v2 (latest)

Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases

International Conference on Machine Learning (ICML), 2023
27 January 2023
Xiaoxia Wu
Cheng-rong Li
Reza Yazdani Aminabadi
Z. Yao
Yuxiong He
    MQ
ArXiv (abs)PDFHTML

Papers citing "Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases"

16 / 16 papers shown
Title
LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text
LegalEval-Q: A New Benchmark for The Quality Evaluation of LLM-Generated Legal Text
Li yunhan
Wu gengshen
AILawELMALM
355
1
0
30 May 2025
Lightweight Embeddings with Graph Rewiring for Collaborative Filtering
Lightweight Embeddings with Graph Rewiring for Collaborative Filtering
Xurong Liang
Tong Chen
Wei Yuan
Hongzhi Yin
126
0
0
25 May 2025
HOT: Hadamard-based Optimized Training
HOT: Hadamard-based Optimized TrainingComputer Vision and Pattern Recognition (CVPR), 2025
Seonggon Kim
Juncheol Shin
Seung-taek Woo
Eunhyeok Park
207
0
0
27 Mar 2025
MergeQuant: Accurate 4-bit Static Quantization of Large Language Models by Channel-wise Calibration
Jinguang Wang
Jiangming Wang
Haifeng Sun
Tingting Yang
Zirui Zhuang
Wanyi Ning
Yuexi Yin
Q. Qi
Jianxin Liao
MQMoMe
163
3
0
07 Mar 2025
Optimizing Large Language Model Training Using FP4 Quantization
Optimizing Large Language Model Training Using FP4 Quantization
Ruizhe Wang
Yeyun Gong
Xiao Liu
Guoshuai Zhao
Ziyue Yang
Baining Guo
Zhengjun Zha
Peng Cheng
MQ
318
29
0
28 Jan 2025
iServe: An Intent-based Serving System for LLMs
iServe: An Intent-based Serving System for LLMs
Dimitrios Liakopoulos
Tianrui Hu
Prasoon Sinha
N. Yadwadkar
VLM
947
0
0
08 Jan 2025
BATON: Enhancing Batch-wise Inference Efficiency for Large Language
  Models via Dynamic Re-batching
BATON: Enhancing Batch-wise Inference Efficiency for Large Language Models via Dynamic Re-batchingThe Web Conference (WWW), 2024
Peizhuang Cong
Qizhi Chen
Haochen Zhao
Tong Yang
KELM
158
2
0
24 Oct 2024
Active-Dormant Attention Heads: Mechanistically Demystifying
  Extreme-Token Phenomena in LLMs
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs
Tianyu Guo
Druv Pai
Yu Bai
Jiantao Jiao
Michael I. Jordan
Song Mei
246
24
0
17 Oct 2024
AERO: Entropy-Guided Framework for Private LLM Inference
AERO: Entropy-Guided Framework for Private LLM Inference
N. Jha
Brandon Reagen
345
5
0
16 Oct 2024
OutlierTune: Efficient Channel-Wise Quantization for Large Language
  Models
OutlierTune: Efficient Channel-Wise Quantization for Large Language Models
Jinguang Wang
Yuexi Yin
Haifeng Sun
Qi Qi
Jingyu Wang
Zirui Zhuang
Tingting Yang
Jianxin Liao
131
2
0
27 Jun 2024
A Semantic-Aware Layer-Freezing Approach to Computation-Efficient Fine-Tuning of Language Models
A Semantic-Aware Layer-Freezing Approach to Computation-Efficient Fine-Tuning of Language Models
Jian Gu
A. Aleti
Chunyang Chen
Hongyu Zhang
196
2
0
17 Jun 2024
How to Parameterize Asymmetric Quantization Ranges for
  Quantization-Aware Training
How to Parameterize Asymmetric Quantization Ranges for Quantization-Aware Training
Jaeseong You
Minseop Park
Kyunggeun Lee
Seokjun An
Chirag I. Patel
Markus Nagel
MQ
141
3
0
25 Apr 2024
Mitigating the Impact of Outlier Channels for Language Model
  Quantization with Activation Regularization
Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization
Aniruddha Nrusimha
Mayank Mishra
Naigang Wang
Dan Alistarh
Yikang Shen
Yoon Kim
MQ
188
17
0
04 Apr 2024
Not All Attention is Needed: Parameter and Computation Efficient
  Transfer Learning for Multi-modal Large Language Models
Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models
Qiong Wu
Yiyi Zhou
Weihao Ye
Xiaoshuai Sun
Rongrong Ji
MoE
132
2
0
22 Mar 2024
A2Q+: Improving Accumulator-Aware Weight Quantization
A2Q+: Improving Accumulator-Aware Weight Quantization
Ian Colbert
Alessandro Pappalardo
Jakoba Petri-Koenig
Yaman Umuroglu
MQ
129
6
0
19 Jan 2024
Token-Scaled Logit Distillation for Ternary Weight Generative Language
  Models
Token-Scaled Logit Distillation for Ternary Weight Generative Language ModelsNeural Information Processing Systems (NeurIPS), 2023
Minsoo Kim
Sihwa Lee
Jangwhan Lee
S. Hong
Duhyeuk Chang
Wonyong Sung
Jungwook Choi
MQ
103
19
0
13 Aug 2023
1