ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1906.00532
  4. Cited By
Efficient 8-Bit Quantization of Transformer Neural Machine Language
  Translation Model

Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model

3 June 2019
Aishwarya Bhandare
Vamsi Sripathi
Deepthi Karkada
Vivek V. Menon
Sun Choi
Kushal Datta
V. Saletore
    MQ
ArXivPDFHTML

Papers citing "Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model"

50 / 71 papers shown
Title
ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration
Mengting Ai
Tianxin Wei
Yifan Chen
Zhichen Zeng
Ritchie Zhao
G. Varatkar
B. Rouhani
Xianfeng Tang
Hanghang Tong
Jingrui He
MoE
47
1
0
10 Mar 2025
Optimizing Large Language Models through Quantization: A Comparative
  Analysis of PTQ and QAT Techniques
Optimizing Large Language Models through Quantization: A Comparative Analysis of PTQ and QAT Techniques
Jahid Hasan
MQ
25
1
0
09 Nov 2024
Token Pruning using a Lightweight Background Aware Vision Transformer
Token Pruning using a Lightweight Background Aware Vision Transformer
Sudhakar Sah
Ravish Kumar
Honnesh Rohmetra
Ehsan Saboori
ViT
23
1
0
12 Oct 2024
RevMUX: Data Multiplexing with Reversible Adapters for Efficient LLM
  Batch Inference
RevMUX: Data Multiplexing with Reversible Adapters for Efficient LLM Batch Inference
Yige Xu
Xu Guo
Zhiwei Zeng
Chunyan Miao
34
0
0
06 Oct 2024
Optimization of DNN-based speaker verification model through efficient
  quantization technique
Optimization of DNN-based speaker verification model through efficient quantization technique
Yeona Hong
Woo-Jin Chung
Hong-Goo Kang
MQ
26
1
0
12 Jul 2024
Outlier-Efficient Hopfield Layers for Large Transformer-Based Models
Outlier-Efficient Hopfield Layers for Large Transformer-Based Models
Jerry Yao-Chieh Hu
Pei-Hsuan Chang
Haozheng Luo
Hong-Yu Chen
Weijian Li
Wei-Po Wang
Han Liu
36
25
0
04 Apr 2024
Efficiently Distilling LLMs for Edge Applications
Efficiently Distilling LLMs for Edge Applications
Achintya Kundu
Fabian Lim
Aaron Chew
L. Wynter
Penny Chong
Rhui Dih Lee
42
6
0
01 Apr 2024
A Comprehensive Survey of Compression Algorithms for Language Models
A Comprehensive Survey of Compression Algorithms for Language Models
Seungcheol Park
Jaehyeon Choi
Sojin Lee
U. Kang
MQ
26
12
0
27 Jan 2024
Cascade Speculative Drafting for Even Faster LLM Inference
Cascade Speculative Drafting for Even Faster LLM Inference
Ziyi Chen
Xiaocong Yang
Jiacheng Lin
Chenkai Sun
Kevin Chen-Chuan Chang
Jie Huang
LRM
19
47
0
18 Dec 2023
FP8-BERT: Post-Training Quantization for Transformer
FP8-BERT: Post-Training Quantization for Transformer
Jianwei Li
Tianchi Zhang
Ian En-Hsu Yen
Dongkuan Xu
MQ
15
5
0
10 Dec 2023
Interpretability Illusions in the Generalization of Simplified Models
Interpretability Illusions in the Generalization of Simplified Models
Dan Friedman
Andrew Kyle Lampinen
Lucas Dixon
Danqi Chen
Asma Ghandeharioun
17
14
0
06 Dec 2023
Sample-based Dynamic Hierarchical Transformer with Layer and Head Flexibility via Contextual Bandit
Fanfei Meng
Lele Zhang
Yu Chen
Yuxin Wang
8
10
0
05 Dec 2023
Efficient Post-training Quantization with FP8 Formats
Efficient Post-training Quantization with FP8 Formats
Haihao Shen
Naveen Mellempudi
Xin He
Q. Gao
Chang‐Bao Wang
Mengni Wang
MQ
23
19
0
26 Sep 2023
Tango: rethinking quantization for graph neural network training on GPUs
Tango: rethinking quantization for graph neural network training on GPUs
Shiyang Chen
Da Zheng
Caiwen Ding
Chengying Huan
Yuede Ji
Hang Liu
GNN
MQ
23
5
0
02 Aug 2023
ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized
  Transformers
ITA: An Energy-Efficient Attention and Softmax Accelerator for Quantized Transformers
Gamze Islamoglu
Moritz Scherer
G. Paulin
Tim Fischer
Victor J. B. Jung
Angelo Garofalo
Luca Benini
MQ
22
11
0
07 Jul 2023
An Efficient Sparse Inference Software Accelerator for Transformer-based
  Language Models on CPUs
An Efficient Sparse Inference Software Accelerator for Transformer-based Language Models on CPUs
Haihao Shen
Hengyu Meng
Bo Dong
Zhe Wang
Ofir Zafrir
...
Hanwen Chang
Qun Gao
Zi. Wang
Guy Boudoukh
Moshe Wasserblat
MoE
31
4
0
28 Jun 2023
Revisiting Token Pruning for Object Detection and Instance Segmentation
Revisiting Token Pruning for Object Detection and Instance Segmentation
Yifei Liu
Mathias Gehrig
Nico Messikommer
Marco Cannici
Davide Scaramuzza
ViT
VLM
37
24
0
12 Jun 2023
Transformer-based models and hardware acceleration analysis in
  autonomous driving: A survey
Transformer-based models and hardware acceleration analysis in autonomous driving: A survey
J. Zhong
Zheng Liu
Xiangshan Chen
ViT
44
17
0
21 Apr 2023
SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers
SwiftTron: An Efficient Hardware Accelerator for Quantized Transformers
Alberto Marchisio
David Durà
Maurizio Capra
Maurizio Martina
Guido Masera
Muhammad Shafique
33
18
0
08 Apr 2023
Block-wise Bit-Compression of Transformer-based Models
Gaochen Dong
W. Chen
16
0
0
16 Mar 2023
Dynamic Stashing Quantization for Efficient Transformer Training
Dynamic Stashing Quantization for Efficient Transformer Training
Guofu Yang
Daniel Lo
Robert D. Mullins
Yiren Zhao
MQ
29
8
0
09 Mar 2023
Teacher Intervention: Improving Convergence of Quantization Aware
  Training for Ultra-Low Precision Transformers
Teacher Intervention: Improving Convergence of Quantization Aware Training for Ultra-Low Precision Transformers
Minsoo Kim
Kyuhong Shim
Seongmin Park
Wonyong Sung
Jungwook Choi
MQ
11
1
0
23 Feb 2023
Binarized Neural Machine Translation
Binarized Neural Machine Translation
Yichi Zhang
Ankush Garg
Yuan Cao
Lukasz Lew
Behrooz Ghorbani
Zhiru Zhang
Orhan Firat
MQ
34
14
0
09 Feb 2023
Understanding and Improving Knowledge Distillation for
  Quantization-Aware Training of Large Transformer Encoders
Understanding and Improving Knowledge Distillation for Quantization-Aware Training of Large Transformer Encoders
Minsoo Kim
Sihwa Lee
S. Hong
Duhyeuk Chang
Jungwook Choi
MQ
16
12
0
20 Nov 2022
Zero-Shot Dynamic Quantization for Transformer Inference
Zero-Shot Dynamic Quantization for Transformer Inference
Yousef El-Kurdi
Jerry Quinn
Avirup Sil
MQ
14
1
0
17 Nov 2022
Fast DistilBERT on CPUs
Fast DistilBERT on CPUs
Haihao Shen
Ofir Zafrir
Bo Dong
Hengyu Meng
Xinyu. Ye
Zhe Wang
Yi Ding
Hanwen Chang
Guy Boudoukh
Moshe Wasserblat
VLM
21
2
0
27 Oct 2022
Legal-Tech Open Diaries: Lesson learned on how to develop and deploy
  light-weight models in the era of humongous Language Models
Legal-Tech Open Diaries: Lesson learned on how to develop and deploy light-weight models in the era of humongous Language Models
Stelios Maroudas
Sotiris Legkas
Prodromos Malakasiotis
Ilias Chalkidis
VLM
AILaw
ALM
ELM
29
4
0
24 Oct 2022
AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of
  Large-Scale Pre-Trained Language Models
AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models
S. Kwon
Jeonghoon Kim
Jeongin Bae
Kang Min Yoo
Jin-Hwa Kim
Baeseong Park
Byeongwook Kim
Jung-Woo Ha
Nako Sung
Dongsoo Lee
MQ
23
30
0
08 Oct 2022
Towards Fine-tuning Pre-trained Language Models with Integer Forward and
  Backward Propagation
Towards Fine-tuning Pre-trained Language Models with Integer Forward and Backward Propagation
Mohammadreza Tayaranian
Alireza Ghaffari
Marzieh S. Tahaei
Mehdi Rezagholizadeh
M. Asgharian
V. Nia
MQ
29
6
0
20 Sep 2022
Efficient Quantized Sparse Matrix Operations on Tensor Cores
Efficient Quantized Sparse Matrix Operations on Tensor Cores
Shigang Li
Kazuki Osawa
Torsten Hoefler
74
31
0
14 Sep 2022
Efficient Methods for Natural Language Processing: A Survey
Efficient Methods for Natural Language Processing: A Survey
Marcos Vinícius Treviso
Ji-Ung Lee
Tianchu Ji
Betty van Aken
Qingqing Cao
...
Emma Strubell
Niranjan Balasubramanian
Leon Derczynski
Iryna Gurevych
Roy Schwartz
28
109
0
31 Aug 2022
ANT: Exploiting Adaptive Numerical Data Type for Low-bit Deep Neural
  Network Quantization
ANT: Exploiting Adaptive Numerical Data Type for Low-bit Deep Neural Network Quantization
Cong Guo
Chen Zhang
Jingwen Leng
Zihan Liu
Fan Yang
Yun-Bo Liu
Minyi Guo
Yuhao Zhu
MQ
16
55
0
30 Aug 2022
Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network
  Accelerator with On-Device Speech Recognition
Sub-8-Bit Quantization Aware Training for 8-Bit Neural Network Accelerator with On-Device Speech Recognition
Kai Zhen
Hieu Duy Nguyen
Ravi Chinta
Nathan Susanj
Athanasios Mouchtaris
Tariq Afzal
Ariya Rastrow
MQ
20
11
0
30 Jun 2022
LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient
  Inference in Large-Scale Generative Language Models
LUT-GEMM: Quantized Matrix Multiplication based on LUTs for Efficient Inference in Large-Scale Generative Language Models
Gunho Park
Baeseong Park
Minsub Kim
Sungjae Lee
Jeonghoon Kim
Beomseok Kwon
S. Kwon
Byeongwook Kim
Youngjoo Lee
Dongsoo Lee
MQ
13
73
0
20 Jun 2022
Train Flat, Then Compress: Sharpness-Aware Minimization Learns More
  Compressible Models
Train Flat, Then Compress: Sharpness-Aware Minimization Learns More Compressible Models
Clara Na
Sanket Vaibhav Mehta
Emma Strubell
62
19
0
25 May 2022
Pyramid-BERT: Reducing Complexity via Successive Core-set based Token
  Selection
Pyramid-BERT: Reducing Complexity via Successive Core-set based Token Selection
Xin Huang
A. Khetan
Rene Bidart
Zohar S. Karnin
17
14
0
27 Mar 2022
The Ecological Footprint of Neural Machine Translation Systems
The Ecological Footprint of Neural Machine Translation Systems
D. Shterionov
Eva Vanmassenhove
32
3
0
04 Feb 2022
NN-LUT: Neural Approximation of Non-Linear Operations for Efficient
  Transformer Inference
NN-LUT: Neural Approximation of Non-Linear Operations for Efficient Transformer Inference
Joonsang Yu
Junki Park
Seongmin Park
Minsoo Kim
Sihwa Lee
Dong Hyun Lee
Jungwook Choi
35
48
0
03 Dec 2021
Efficient Softmax Approximation for Deep Neural Networks with Attention
  Mechanism
Efficient Softmax Approximation for Deep Neural Networks with Attention Mechanism
Ihor Vasyltsov
Wooseok Chang
25
12
0
21 Nov 2021
A Survey on Green Deep Learning
A Survey on Green Deep Learning
Jingjing Xu
Wangchunshu Zhou
Zhiyi Fu
Hao Zhou
Lei Li
VLM
73
83
0
08 Nov 2021
Understanding and Overcoming the Challenges of Efficient Transformer
  Quantization
Understanding and Overcoming the Challenges of Efficient Transformer Quantization
Yelysei Bondarenko
Markus Nagel
Tijmen Blankevoort
MQ
12
133
0
27 Sep 2021
The NiuTrans System for WNGT 2020 Efficiency Task
The NiuTrans System for WNGT 2020 Efficiency Task
Chi Hu
Bei Li
Ye Lin
Yinqiao Li
Yanyang Li
Chenglong Wang
Tong Xiao
Jingbo Zhu
20
7
0
16 Sep 2021
The NiuTrans System for the WMT21 Efficiency Task
The NiuTrans System for the WMT21 Efficiency Task
Chenglong Wang
Chi Hu
Yongyu Mu
Zhongxiang Yan
Siming Wu
...
Hang Cao
Bei Li
Ye Lin
Tong Xiao
Jingbo Zhu
22
2
0
16 Sep 2021
Learned Token Pruning for Transformers
Learned Token Pruning for Transformers
Sehoon Kim
Sheng Shen
D. Thorsley
A. Gholami
Woosuk Kwon
Joseph Hassoun
Kurt Keutzer
9
145
0
02 Jul 2021
Improving the Efficiency of Transformers for Resource-Constrained
  Devices
Improving the Efficiency of Transformers for Resource-Constrained Devices
Hamid Tabani
Ajay Balasubramaniam
Shabbir Marzban
Elahe Arani
Bahram Zonooz
33
20
0
30 Jun 2021
On the Distribution, Sparsity, and Inference-time Quantization of
  Attention Values in Transformers
On the Distribution, Sparsity, and Inference-time Quantization of Attention Values in Transformers
Tianchu Ji
Shraddhan Jain
M. Ferdman
Peter Milder
H. A. Schwartz
Niranjan Balasubramanian
MQ
42
15
0
02 Jun 2021
LEAP: Learnable Pruning for Transformer-based Models
LEAP: Learnable Pruning for Transformer-based Models
Z. Yao
Xiaoxia Wu
Linjian Ma
Sheng Shen
Kurt Keutzer
Michael W. Mahoney
Yuxiong He
20
7
0
30 May 2021
Low-Precision Hardware Architectures Meet Recommendation Model Inference
  at Scale
Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale
Zhaoxia Deng
Deng
Jongsoo Park
P. T. P. Tang
Haixin Liu
...
S. Nadathur
Changkyu Kim
Maxim Naumov
S. Naghshineh
M. Smelyanskiy
15
11
0
26 May 2021
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference
LeViT: a Vision Transformer in ConvNet's Clothing for Faster Inference
Ben Graham
Alaaeldin El-Nouby
Hugo Touvron
Pierre Stock
Armand Joulin
Hervé Jégou
Matthijs Douze
ViT
13
768
0
02 Apr 2021
n-hot: Efficient bit-level sparsity for powers-of-two neural network
  quantization
n-hot: Efficient bit-level sparsity for powers-of-two neural network quantization
Yuiko Sakuma
Hiroshi Sumihiro
Jun Nishikawa
Toshiki Nakamura
Ryoji Ikegaya
MQ
35
1
0
22 Mar 2021
12
Next