254
v1v2v3 (latest)

EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization

Main:8 Pages
7 Figures
Bibliography:2 Pages
13 Tables
Appendix:6 Pages
Abstract

Mixture-of-Experts (MoE) models enable scalable computation and performance in large-scale deep learning but face quantization challenges due to sparse expert activation and dynamic routing. Existing post-training quantization (PTQ) methods fail to address activation outliers, routing instability, and sparse expert calibration, leading to significant performance degradation. To address this, we propose EAQuant, a PTQ framework tailored for MoE architectures. Our method introduces three expert-aware innovations: (1) smoothing aggregation to suppress activation outliers, (2) routing consistency alignment to preserve expert selection post-quantization, and (3) calibration data balance to optimize sparsely activated experts. These strategies collectively enable robust, high-precision quantization of MoE models under ultra-low-bitthis http URLexperiments across several extreme quantization settings (e.g., W4A4/W3A4/W3A3/W2A4) demonstrate that EAQuant significantly outperforms existing methods, achieving average accuracy improvements of 1.15 - 13.81% across three diverse MoE architectures, with particularly pronounced gains in reasoning tasks and robust performance retention under aggressive quantization. By integrating these innovations, EAQuant establishes a new state-of-the-art for high-precision, efficient MoE modelthis http URLcode is available atthis https URL.

View on arXiv
Comments on this paper