ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.04532
  4. Cited By
QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving

7 May 2024
Yujun Lin
Haotian Tang
Shang Yang
Zhekai Zhang
Guangxuan Xiao
Chuang Gan
Song Han
ArXivPDFHTML

Papers citing "QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving"

50 / 52 papers shown
Title
LightNobel: Improving Sequence Length Limitation in Protein Structure Prediction Model via Adaptive Activation Quantization
LightNobel: Improving Sequence Length Limitation in Protein Structure Prediction Model via Adaptive Activation Quantization
Seunghee Han
S. Choi
J. Kim
14
0
0
09 May 2025
MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design
MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design
Haojie Duanmu
Xiuhong Li
Zhihang Yuan
Size Zheng
Jiangfei Duan
Xingcheng Zhang
Dahua Lin
MQ
MoE
55
0
0
09 May 2025
Hexcute: A Tile-based Programming Language with Automatic Layout and Task-Mapping Synthesis
Hexcute: A Tile-based Programming Language with Automatic Layout and Task-Mapping Synthesis
X. Zhang
Yaoyao Ding
Yang Hu
Gennady Pekhimenko
36
0
0
22 Apr 2025
From Large to Super-Tiny: End-to-End Optimization for Cost-Efficient LLMs
From Large to Super-Tiny: End-to-End Optimization for Cost-Efficient LLMs
Jiliang Ni
Jiachen Pu
Zhongyi Yang
Kun Zhou
Hui Wang
Xiaoliang Xiao
Dakui Wang
Xin Li
Jingfeng Luo
Conggang Hu
29
0
0
18 Apr 2025
AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design
AccLLM: Accelerating Long-Context LLM Inference Via Algorithm-Hardware Co-Design
Yanbiao Liang
Huihong Shi
Haikuo Shao
Zhongfeng Wang
10
0
0
07 Apr 2025
Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference
Cocktail: Chunk-Adaptive Mixed-Precision Quantization for Long-Context LLM Inference
Wei Tao
Bin Zhang
Xiaoyang Qu
Jiguang Wan
Jianzong Wang
29
1
0
30 Mar 2025
Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models
Quamba2: A Robust and Scalable Post-training Quantization Framework for Selective State Space Models
Hung-Yueh Chiang
Chi-chih Chang
N. Frumkin
Kai-Chiang Wu
Mohamed S. Abdelfattah
Diana Marculescu
MQ
46
0
0
28 Mar 2025
QUAD: Quantization and Parameter-Efficient Tuning of LLM with Activation Decomposition
QUAD: Quantization and Parameter-Efficient Tuning of LLM with Activation Decomposition
Yuxuan Hu
Xiaodong Chen
C. Li
H. Chen
J. Zhang
MQ
58
0
0
25 Mar 2025
LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation
LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation
Han Chen
Zicong Jiang
Zining Zhang
Bingsheng He
Pingyi Luo
M. Lu
Yuqiang Chen
MQ
40
0
0
25 Mar 2025
BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache
BitDecoding: Unlocking Tensor Cores for Long-Context LLMs Decoding with Low-Bit KV Cache
Dayou Du
Shijie Cao
Jianyi Cheng
Ting Cao
M. Yang
MQ
58
0
0
24 Mar 2025
XAttention: Block Sparse Attention with Antidiagonal Scoring
XAttention: Block Sparse Attention with Antidiagonal Scoring
Ruyi Xu
Guangxuan Xiao
Haofeng Huang
Junxian Guo
Song Han
64
3
0
20 Mar 2025
ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts
ML-SpecQD: Multi-Level Speculative Decoding with Quantized Drafts
E. Georganas
Dhiraj D. Kalamkar
Alexander Kozlov
A. Heinecke
MQ
39
0
0
17 Mar 2025
PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices
PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices
Yangyijian Liu
Jun Yu Li
Wu-Jun Li
24
0
0
15 Mar 2025
Accurate INT8 Training Through Dynamic Block-Level Fallback
Pengle Zhang
Jia wei
Jintao Zhang
Jun-Jie Zhu
Jianfei Chen
MQ
68
3
0
13 Mar 2025
OuroMamba: A Data-Free Quantization Framework for Vision Mamba Models
Akshat Ramachandran
Mingyu Lee
Huan Xu
Souvik Kundu
Tushar Krishna
MQ
49
1
0
13 Mar 2025
Alchemist: Towards the Design of Efficient Online Continual Learning System
Yuyang Huang
Yuhan Liu
Haryadi S. Gunawi
Beibin Li
Changho Hwang
CLL
OnRL
98
0
0
03 Mar 2025
Identifying Sensitive Weights via Post-quantization Integral
Yuezhou Hu
Weiyu Huang
Zichen Liang
C. L. P. Chen
Jintao Zhang
J. Zhu
Jianfei Chen
MQ
37
2
0
28 Feb 2025
Binary Neural Networks for Large Language Model: A Survey
Binary Neural Networks for Large Language Model: A Survey
Liangdong Liu
Zhitong Zheng
Cong Wang
Tianhuang Su
Z. Yang
MQ
58
0
0
26 Feb 2025
SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention
SVDq: 1.25-bit and 410x Key Cache Compression for LLM Attention
Hong Yankun
Li Xing
Zhen Hui-Ling
Yu Xianzhi
Liu Wulong
Yuan Mingxuan
MQ
74
0
0
24 Feb 2025
Optimizing Large Language Model Training Using FP4 Quantization
Optimizing Large Language Model Training Using FP4 Quantization
Ruizhe Wang
Yeyun Gong
Xiao Liu
Guoshuai Zhao
Ziyue Yang
Baining Guo
Zhengjun Zha
Peng Cheng
MQ
61
4
0
28 Jan 2025
PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization
PrefixQuant: Eliminating Outliers by Prefixed Tokens for Large Language Models Quantization
Mengzhao Chen
Yi Liu
Jiahao Wang
Yi Bin
Wenqi Shao
Ping Luo
MQ
58
2
0
28 Jan 2025
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
Han Guo
William Brandon
Radostin Cholakov
Jonathan Ragan-Kelley
Eric P. Xing
Yoon Kim
MQ
64
12
0
20 Jan 2025
Rethinking Post-Training Quantization: Introducing a Statistical Pre-Calibration Approach
Rethinking Post-Training Quantization: Introducing a Statistical Pre-Calibration Approach
Alireza Ghaffari
Sharareh Younesian
Boxing Chen
Vahid Partovi Nia
M. Asgharian
MQ
51
0
0
17 Jan 2025
Unifying KV Cache Compression for Large Language Models with LeanKV
Unifying KV Cache Compression for Large Language Models with LeanKV
Yanqi Zhang
Yuwei Hu
Runyuan Zhao
John C. S. Lui
Haibo Chen
MQ
89
5
0
04 Dec 2024
MixPE: Quantization and Hardware Co-design for Efficient LLM Inference
MixPE: Quantization and Hardware Co-design for Efficient LLM Inference
Yu Zhang
M. Wang
Lancheng Zou
Wulong Liu
Hui-Ling Zhen
M. Yuan
Bei Yu
MQ
69
1
0
25 Nov 2024
Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped
  Activation Data Format
Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format
Chao Fang
Man Shi
Robin Geens
Arne Symons
Zhongfeng Wang
Marian Verhelst
64
0
0
24 Nov 2024
FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers
FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers
Zehua Pei
Hui-Ling Zhen
Xianzhi Yu
Sinno Jialin Pan
M. Yuan
Bei Yu
AI4CE
79
0
0
21 Nov 2024
Closer Look at Efficient Inference Methods: A Survey of Speculative
  Decoding
Closer Look at Efficient Inference Methods: A Survey of Speculative Decoding
Hyun Ryu
Eric Kim
72
3
0
20 Nov 2024
SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization
Jintao Zhang
Haofeng Huang
Pengle Zhang
Jia wei
Jun-Jie Zhu
Jianfei Chen
VLM
MQ
50
2
0
17 Nov 2024
Context Parallelism for Scalable Million-Token Inference
Context Parallelism for Scalable Million-Token Inference
Amy Yang
Jingyi Yang
Aya Ibrahim
Xinfeng Xie
Bangsheng Tang
Grigory Sizov
Jeremy Reizenstein
Jongsoo Park
Jianyu Huang
MoE
LRM
50
5
0
04 Nov 2024
Watermarking Large Language Models and the Generated Content:
  Opportunities and Challenges
Watermarking Large Language Models and the Generated Content: Opportunities and Challenges
Ruisi Zhang
F. Koushanfar
WaLM
36
0
0
24 Oct 2024
QSpec: Speculative Decoding with Complementary Quantization Schemes
QSpec: Speculative Decoding with Complementary Quantization Schemes
Juntao Zhao
Wenhao Lu
Sheng Wang
Lingpeng Kong
Chuan Wu
MQ
51
5
0
15 Oct 2024
TemporalBench: Benchmarking Fine-grained Temporal Understanding for
  Multimodal Video Models
TemporalBench: Benchmarking Fine-grained Temporal Understanding for Multimodal Video Models
Mu Cai
Reuben Tan
Jianrui Zhang
Bocheng Zou
Kai Zhang
...
Yao Dou
J. Park
Jianfeng Gao
Yong Jae Lee
Jianwei Yang
34
12
0
14 Oct 2024
FlatQuant: Flatness Matters for LLM Quantization
FlatQuant: Flatness Matters for LLM Quantization
Yuxuan Sun
Ruikang Liu
Haoli Bai
Han Bao
Kang Zhao
...
Lu Hou
Chun Yuan
Xin Jiang
W. Liu
Jun Yao
MQ
46
3
0
12 Oct 2024
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration
Jintao Zhang
Jia wei
Pengle Zhang
Jun-Jie Zhu
Jun Zhu
Jianfei Chen
VLM
MQ
69
18
0
03 Oct 2024
Rotated Runtime Smooth: Training-Free Activation Smoother for accurate
  INT4 inference
Rotated Runtime Smooth: Training-Free Activation Smoother for accurate INT4 inference
Ke Yi
Zengke Liu
Jianwei Zhang
Chengyuan Li
Tong Zhang
Junyang Lin
Jingren Zhou
MQ
38
0
0
30 Sep 2024
Efficient Arbitrary Precision Acceleration for Large Language Models on
  GPU Tensor Cores
Efficient Arbitrary Precision Acceleration for Large Language Models on GPU Tensor Cores
Shaobo Ma
Chao Fang
Haikuo Shao
Zhongfeng Wang
23
3
0
26 Sep 2024
Inference-Friendly Models With MixAttention
Inference-Friendly Models With MixAttention
Shashank Rajput
Ying Sheng
Sean Owen
Vitaliy Chiley
74
1
0
23 Sep 2024
CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context
  Scenarios
CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios
Luning Wang
Shiyao Li
Xuefei Ning
Zhihang Yuan
Shengen Yan
Guohao Dai
Yu Wang
33
0
0
16 Sep 2024
NanoFlow: Towards Optimal Large Language Model Serving Throughput
NanoFlow: Towards Optimal Large Language Model Serving Throughput
Kan Zhu
Yilong Zhao
Liangyu Zhao
Gefei Zuo
Yile Gu
...
Keisuke Kamahori
Chien-Yu Lin
Stephanie Wang
Arvind Krishnamurthy
Baris Kasikci
23
26
0
22 Aug 2024
ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large
  Language Models
ABQ-LLM: Arbitrary-Bit Quantized Inference Acceleration for Large Language Models
Chao Zeng
Songwei Liu
Yusheng Xie
Hong Liu
Xiaojian Wang
Miao Wei
Shu Yang
Fangmin Chen
Xing Mei
MQ
27
5
0
16 Aug 2024
LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices
LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices
Jung Hyun Lee
Jeonghoon Kim
J. Yang
S. Kwon
Eunho Yang
Kang Min Yoo
Dongsoo Lee
MQ
25
2
0
16 Jul 2024
EfficientQAT: Efficient Quantization-Aware Training for Large Language
  Models
EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
Mengzhao Chen
Wenqi Shao
Peng Xu
Jiahao Wang
Peng Gao
Kaipeng Zhang
Yu Qiao
Ping Luo
MQ
34
21
0
10 Jul 2024
FoldGPT: Simple and Effective Large Language Model Compression Scheme
FoldGPT: Simple and Effective Large Language Model Compression Scheme
Songwei Liu
Chao Zeng
Lianqiang Li
Chenqian Yan
Lean Fu
Xing Mei
Fangmin Chen
26
4
0
01 Jul 2024
VcLLM: Video Codecs are Secretly Tensor Codecs
VcLLM: Video Codecs are Secretly Tensor Codecs
Ceyu Xu
Yongji Wu
Xinyu Yang
Beidi Chen
Matthew Lentz
Danyang Zhuo
Lisa Wu Wills
42
0
0
29 Jun 2024
From Decoding to Meta-Generation: Inference-time Algorithms for Large
  Language Models
From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
Sean Welleck
Amanda Bertsch
Matthew Finlayson
Hailey Schoelkopf
Alex Xie
Graham Neubig
Ilia Kulikov
Zaid Harchaoui
33
45
0
24 Jun 2024
QQQ: Quality Quattuor-Bit Quantization for Large Language Models
QQQ: Quality Quattuor-Bit Quantization for Large Language Models
Ying Zhang
Peng Zhang
Mincong Huang
Jingyang Xiang
Yujie Wang
Chao Wang
Yineng Zhang
Lei Yu
Chuan Liu
Wei Lin
VLM
MQ
31
3
0
14 Jun 2024
Low-Rank Quantization-Aware Training for LLMs
Low-Rank Quantization-Aware Training for LLMs
Yelysei Bondarenko
Riccardo Del Chiaro
Markus Nagel
MQ
25
8
0
10 Jun 2024
ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation
ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation
Tianchen Zhao
Tongcheng Fang
Haofeng Huang
Enshu Liu
Widyadewi Soedarmadji
...
Shengen Yan
Huazhong Yang
Xuefei Ning
Xuefei Ning
Yu Wang
MQ
VGen
94
21
0
04 Jun 2024
Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs
Integer Scale: A Free Lunch for Faster Fine-grained Quantization of LLMs
Qingyuan Li
Ran Meng
Yiduo Li
Bo Zhang
Yifan Lu
Yerui Sun
Lin Ma
Yuchen Xie
MQ
33
0
0
23 May 2024
12
Next