479
v1v2v3 (latest)

Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference

Main:11 Pages
11 Figures
Bibliography:2 Pages
4 Tables
Abstract

Mixture-of-Experts (MoE) has become a practical architecture for scaling LLM capacity while keeping per-token compute modest, but deploying MoE models on a single, memory-limited GPU remains difficult because expert weights dominate the HBM footprint. Existing expert offloading and prefetching systems reduce the resident set, yet they often pay expert-loading costs on the critical path when activation becomes dense. Post-training quantization (PTQ) lowers the footprint without transfers, but prevailing pipelines fix expert bit-widths offline and assume routing remains stable, even though MoE expert utilization is heavy-tailed and the hot set can shift across workloads.We present DynaExq, a runtime-aware mixed-precision serving system that treats single-GPU MoE inference under a hard HBM envelope as an online, budget-constrained precision allocation problem. The key insight is to keep the experts that dominate runtime traffic resident at higher precision, while maintaining a low-precision fallback for the remaining experts, so the system can reduce transfer volume and avoid the waiting latency that limits offloading and prefetching under dense activation. DynaExq estimates long-horizon expert hotness from router traces, selects a per-layer high-precision resident set via a budget-feasible top-nn rule, and applies promotions and demotions asynchronously through stable expert handles so the forward pass always executes on a fully materialized expert version. Across Qwen3-MoE-30B/80B and six benchmarks, DynaExq improves accuracy over static PTQ on Qwen3-80B (73.09% to 77.57%) under comparable device-memory budgets and achieves up to 2.73x higher throughput than offloading/prefetch baselines at batch size 32.

View on arXiv
Comments on this paper