ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.19102
  4. Cited By
Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
v1v2v3 (latest)

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Conference on Machine Learning and Systems (MLSys), 2023
29 October 2023
Yilong Zhao
Chien-Yu Lin
Kan Zhu
Zihao Ye
Lequn Chen
Wenlei Bao
Luis Ceze
Arvind Krishnamurthy
Tianqi Chen
Baris Kasikci
    MQ
ArXiv (abs)PDFHTMLHuggingFace (11 upvotes)

Papers citing "Atom: Low-bit Quantization for Efficient and Accurate LLM Serving"

50 / 129 papers shown
Title
Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
Accelerating Large-Scale Reasoning Model Inference with Sparse Self-Speculative Decoding
Yilong Zhao
Jiaming Tang
Kan Zhu
Zihao Ye
Chi-chih Chang
...
Mohamed S. Abdelfattah
Mingyu Gao
Baris Kasikci
Song Han
Ion Stoica
ReLMLRM
116
0
0
01 Dec 2025
SingleQuant: Efficient Quantization of Large Language Models in a Single Pass
SingleQuant: Efficient Quantization of Large Language Models in a Single Pass
Jinying Xiao
Bin Ji
Shasha Li
Xiaodong Liu
Ma Jun
Ye Zhong
Wei Li
Xuan Xie
Qingbo Wu
Jie Yu
MQ
44
0
0
27 Nov 2025
Range Asymmetric Numeral Systems-Based Lightweight Intermediate Feature Compression for Split Computing of Deep Neural Networks
Range Asymmetric Numeral Systems-Based Lightweight Intermediate Feature Compression for Split Computing of Deep Neural Networks
Mingyu Sung
Suhwan Im
Vikas Palakonda
Jae-Mo Kang
80
0
0
11 Nov 2025
P3-LLM: An Integrated NPU-PIM Accelerator for LLM Inference Using Hybrid Numerical Formats
P3-LLM: An Integrated NPU-PIM Accelerator for LLM Inference Using Hybrid Numerical Formats
Yuzong Chen
Chao Fang
Xilai Dai
Yuheng Wu
Thierry Tambe
Marian Verhelst
Mohamed S. Abdelfattah
187
0
0
10 Nov 2025
Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing
Memory- and Latency-Constrained Inference of Large Language Models via Adaptive Split Computing
Mingyu Sung
Vikas Palakonda
Suhwan Im
Sunghwan Moon
Il-Min Kim
Sangseok Yun
Jae-Mo Kang
MQ
379
0
0
06 Nov 2025
DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization
DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization
Yuantian Shao
Yuanteng Chen
Peisong Wang
Jianlin Yu
Jing Lin
Yiwu Yao
Zhihui Wei
Jian Cheng
MQ
256
0
0
06 Nov 2025
KV Cache Transform Coding for Compact Storage in LLM Inference
KV Cache Transform Coding for Compact Storage in LLM Inference
Konrad Staniszewski
Adrian Łańcucki
VLM
260
0
0
03 Nov 2025
FlashEVA: Accelerating LLM inference via Efficient Attention
FlashEVA: Accelerating LLM inference via Efficient Attention
Juan Gabriel Kostelec
Qinghai Guo
147
0
0
01 Nov 2025
Energy-Efficient and Dequantization-Free Q-LLMs: A Spiking Neural Network Approach to Salient Value Mitigation
Energy-Efficient and Dequantization-Free Q-LLMs: A Spiking Neural Network Approach to Salient Value Mitigation
Chenyu Wang
Zhanglu Yan
Zhi Zhou
Xu Chen
Weng-Fai Wong
MQ
148
0
0
22 Oct 2025
DynaKV: Enabling Accurate and Efficient Long-Sequence LLM Decoding on Smartphones
DynaKV: Enabling Accurate and Efficient Long-Sequence LLM Decoding on Smartphones
Tuowei Wang
Minxing Huang
Fengzu Li
Ligeng Chen
Jinrui Zhang
Ju Ren
170
1
0
20 Oct 2025
Mixed-Precision Quantization for Language Models: Techniques and Prospects
Mixed-Precision Quantization for Language Models: Techniques and Prospects
M. Rakka
Marios Fournarakis
Olga Krestinskaya
Jinane Bazzi
K. Salama
Fadi J. Kurdahi
A. Eltawil
M. Fouda
MQ
187
0
0
19 Oct 2025
FraQAT: Quantization Aware Training with Fractional bits
FraQAT: Quantization Aware Training with Fractional bits
Luca Morreale
Alberto Gil C. P. Ramos
Malcolm Chadwick
Mehid Noroozi
Ruchika Chavhan
Abhinav Mehrotra
S. Bhattacharya
MQ
141
0
0
16 Oct 2025
Nondeterminism-Aware Optimistic Verification for Floating-Point Neural Networks
Nondeterminism-Aware Optimistic Verification for Floating-Point Neural Networks
Jianzhu Yao
Hongxu Su
Taobo Liao
Zerui Cheng
Huan Zhang
Xuechao Wang
Pramod Viswanath
88
0
0
15 Oct 2025
SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference
SPAD: Specialized Prefill and Decode Hardware for Disaggregated LLM Inference
Hengrui Zhang
Pratyush Patel
August Ning
D. Wentzlaff
MoE
89
1
0
09 Oct 2025
FlexiQ: Adaptive Mixed-Precision Quantization for Latency/Accuracy Trade-Offs in Deep Neural Networks
FlexiQ: Adaptive Mixed-Precision Quantization for Latency/Accuracy Trade-Offs in Deep Neural Networks
Jaemin Kim
Hongjun Um
Sungkyun Kim
Yongjun Park
Jiwon Seo
MQ
105
0
0
03 Oct 2025
Layer-wise dynamic rank for compressing large language models
Layer-wise dynamic rank for compressing large language models
Zhendong Mi
Bian Sun
Grace Li Zhang
Shaoyi Huang
ALM
144
0
0
30 Sep 2025
Rearchitecting Datacenter Lifecycle for AI: A TCO-Driven Framework
Rearchitecting Datacenter Lifecycle for AI: A TCO-Driven Framework
Jovan Stojkovic
Chaojie Zhang
Íñigo Goiri
Ricardo Bianchini
96
0
0
30 Sep 2025
Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization
Bridging the Gap Between Promise and Performance for Microscaling FP4 Quantization
Vage Egiazarian
Roberto L. Castro
Denis Kuznedelev
Andrei Panferov
Eldar Kurtic
...
Alexandre Marques
Mark Kurtz
Saleh Ashkboos
Torsten Hoefler
Dan Alistarh
MQ
188
1
0
27 Sep 2025
LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving
LiquidGEMM: Hardware-Efficient W4A8 GEMM Kernel for High-Performance LLM Serving
Huanqi Hu
Bowen Xiao
Shixuan Sun
Jianian Yin
Zhexi Zhang
...
Chengquan Jiang
Weiqi Xu
Xiaoying Jia
Xin Liu
Minyi Guo
MQVLM
86
3
0
01 Sep 2025
Quantized but Deceptive? A Multi-Dimensional Truthfulness Evaluation of Quantized LLMs
Quantized but Deceptive? A Multi-Dimensional Truthfulness Evaluation of Quantized LLMs
Y. Fu
Xianxuan Long
Runchao Li
Haotian Yu
Mu Sheng
Xiaotian Han
Yu Yin
Pan Li
HILM
125
4
0
26 Aug 2025
APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM Acceleration
APT-LLM: Exploiting Arbitrary-Precision Tensor Core Computing for LLM AccelerationIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 2025
Shaobo Ma
Chao Fang
Haikuo Shao
Zhongfeng Wang
88
0
0
26 Aug 2025
MoEcho: Exploiting Side-Channel Attacks to Compromise User Privacy in Mixture-of-Experts LLMs
MoEcho: Exploiting Side-Channel Attacks to Compromise User Privacy in Mixture-of-Experts LLMs
Ruyi Ding
Tianhong Xu
Xinyi Shen
A. A. Ding
Yunsi Fei
MoEAAML
104
2
0
20 Aug 2025
Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs
Quantization Meets dLLMs: A Systematic Study of Post-training Quantization for Diffusion LLMs
Haokun Lin
Haobo Xu
Yichen Wu
Ziyu Guo
Renrui Zhang
Zhichao Lu
Ying Wei
Gang Qu
Zhenan Sun
DiffMMQ
134
7
0
20 Aug 2025
FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design
FlexQ: Efficient Post-training INT6 Quantization for LLM Serving via Algorithm-System Co-Design
Hao Zhang
Aining Jia
Weifeng Bu
Y. Cai
Kai Sheng
Hao Chen
Xin He
MQ
94
0
0
06 Aug 2025
KLLM: Fast LLM Inference with K-Means Quantization
KLLM: Fast LLM Inference with K-Means Quantization
Xueying Wu
Baijun Zhou
Zhihui Gao
Yuzhe Fu
Qilin Zheng
Yintao He
Hai Helen Li
MQ
203
0
0
30 Jul 2025
A Comprehensive Evaluation on Quantization Techniques for Large Language Models
A Comprehensive Evaluation on Quantization Techniques for Large Language Models
Yutong Liu
Cairong Zhao
Guosheng Hu
MQ
159
0
0
23 Jul 2025
CompLeak: Deep Learning Model Compression Exacerbates Privacy Leakage
CompLeak: Deep Learning Model Compression Exacerbates Privacy Leakage
Na Li
Yansong Gao
Hongsheng Hu
Boyu Kuang
Anmin Fu
164
0
0
22 Jul 2025
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
RetroInfer: A Vector-Storage Approach for Scalable Long-Context LLM Inference
Emmanouil Benetos
Jiawei Zhang
Baotong Lu
Qianxi Zhang
Chengruidong Zhang
...
Chen Chen
Mingxing Zhang
Yuqing Yang
Fan Yang
Mao Yang
235
4
0
01 Jul 2025
MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware Experts
MoQAE: Mixed-Precision Quantization for Long-Context LLM Inference via Mixture of Quantization-Aware ExpertsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Wei Tao
Haocheng Lu
Xiaoyang Qu
Bin Zhang
Kai Lu
Jiguang Wan
Jianzong Wang
MQMoE
189
3
0
09 Jun 2025
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration
Xianglong Yan
Zhiteng Li
Tianao Zhang
Linghe Kong
Yulun Zhang
Yulun Zhang
Yunbo Wang
339
3
0
30 May 2025
AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity
AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity
Yu Zhang
Dong Guo
Fang Wu
Guoliang Zhu
Dian Ding
Yiming Zhang
228
1
0
29 May 2025
Learning Interpretable Differentiable Logic Networks for Tabular Regression
Learning Interpretable Differentiable Logic Networks for Tabular Regression
C. Yue
N. Jha
319
1
0
29 May 2025
FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration
FireQ: Fast INT4-FP8 Kernel and RoPE-aware Quantization for LLM Inference Acceleration
Daehyeon Baek
Jieun Choi
Jimyoung Son
Kyungmin Bin
Seungbeom Choi
Kihyo Moon
Minsung Jang
Hyojung Lee
MQ
181
0
0
27 May 2025
Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs
Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs
Hao Kang
Qingru Zhang
Han Cai
Weiyuan Xu
Tushar Krishna
Yilun Du
Tsachy Weissman
202
3
0
26 May 2025
Efficient and Workload-Aware LLM Serving via Runtime Layer Swapping and KV Cache Resizing
Efficient and Workload-Aware LLM Serving via Runtime Layer Swapping and KV Cache Resizing
Zhaoyuan Su
Tingfeng Lan
Zirui Wang
Juncheng Yang
Yue Cheng
204
1
0
24 May 2025
Model-Distributed Inference for Large Language Models at the Edge
Model-Distributed Inference for Large Language Models at the EdgeIEEE Workshop on Local and Metropolitan Area Networks (LAN/MAN), 2025
Davide Macario
H. Seferoglu
Erdem Koyuncu
193
2
0
13 May 2025
Turning LLM Activations Quantization-Friendly
Turning LLM Activations Quantization-FriendlyInternational Symposium on Applied Computational Intelligence and Informatics (SACI), 2025
Patrik Czakó
Gábor Kertész
Sándor Szénási
MQ
112
1
0
11 May 2025
MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design
MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design
Haojie Duanmu
Xiuhong Li
Zhihang Yuan
Size Zheng
Jiangfei Duan
Xingcheng Zhang
Dahua Lin
MQMoE
894
8
0
09 May 2025
Rethinking Memory in LLM based Agents: Representations, Operations, and Emerging Topics
Rethinking Memory in LLM based Agents: Representations, Operations, and Emerging Topics
Yiming Du
Wenyu Huang
Danna Zheng
Zhaowei Wang
Sébastien Montella
Mirella Lapata
Kam-Fai Wong
Jeff Z. Pan
KELMMU
617
17
0
01 May 2025
Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics
Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics
Cong Xu
Wenbin Liang
Mo Yu
Anan Liu
Jianchao Tan
Lizhuang Ma
Jiangming Wang
Jun Wang
Weinan Zhang
Wei Zhang
MQ
284
0
0
01 May 2025
Taming the Titans: A Survey of Efficient LLM Inference Serving
Taming the Titans: A Survey of Efficient LLM Inference Serving
Ranran Zhen
Junlin Li
Yixin Ji
Zhiyong Yang
Tong Liu
Qingrong Xia
Xinyu Duan
Zehao Wang
Baoxing Huai
Hao Fei
LLMAG
353
6
0
28 Apr 2025
FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference
FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference
Coleman Hooper
Charbel Sakr
Ben Keller
Rangharajan Venkatesan
Kurt Keutzer
Siyang Song
Brucek Khailany
MQ
245
1
0
19 Apr 2025
HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
Avinash Kumar
Shashank Nag
Jason Clemons
L. John
Poulami Das
416
1
0
14 Apr 2025
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints
MoE-Lens: Towards the Hardware Limit of High-Throughput MoE LLM Serving Under Resource Constraints
Yichao Yuan
Lin Ma
Nishil Talati
MoE
245
1
0
12 Apr 2025
GPT Carry-On: Training Foundation Model for Customization Could Be Simple, Scalable and Affordable
GPT Carry-On: Training Foundation Model for Customization Could Be Simple, Scalable and Affordable
Jianqiao Wangni
104
0
0
10 Apr 2025
Achieving binary weight and activation for LLMs using Post-Training Quantization
Achieving binary weight and activation for LLMs using Post-Training QuantizationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Siqing Song
Chuang Wang
Ruiqi Wang
Yi Yang
Xuyao Zhang
MQ
362
0
0
07 Apr 2025
SQuat: Subspace-orthogonal KV Cache Quantization
SQuat: Subspace-orthogonal KV Cache Quantization
Hao Wang
Ligong Han
Kai Xu
Akash Srivastava
MQ
293
2
0
31 Mar 2025
Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference
Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM InferenceDesign, Automation and Test in Europe (DATE), 2025
Wei Tao
Bin Zhang
Xiaoyang Qu
Jiguang Wan
Jianzong Wang
322
3
0
30 Mar 2025
Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models
Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models
Hung-Yueh Chiang
Chi-chih Chang
N. Frumkin
Kai-Chiang Wu
Mohamed S. Abdelfattah
Diana Marculescu
MQ
1.0K
2
0
28 Mar 2025
Exploring the Roles of Large Language Models in Reshaping Transportation Systems: A Survey, Framework, and Roadmap
Exploring the Roles of Large Language Models in Reshaping Transportation Systems: A Survey, Framework, and Roadmap
Tong Nie
Jian Sun
Wei Ma
507
19
0
27 Mar 2025
123
Next