QVGen: Pushing the Limit of Quantized Video Generative Models
- DiffMMQVGen
Video diffusion models (DMs) have enabled high-quality video synthesis. Yet, their substantial computational and memory demands pose serious challenges to real-world deployment, even on high-end GPUs. As a commonly adopted solution, quantization has proven notable success in reducing cost for image DMs, while its direct application to video DMs remains ineffective. In this paper, we present QVGen, a novel quantization-aware training (QAT) framework tailored for high-performance and inference-efficient video DMs under extremely low-bit quantization (e.g., 4-bit or below). We begin with a theoretical analysis demonstrating that reducing the gradient norm is essential to facilitate convergence for QAT. To this end, we introduce auxiliary modules () to mitigate large quantization errors, leading to significantly enhanced convergence. To eliminate the inference overhead of , we propose a rank-decay strategy that progressively eliminates . Specifically, we repeatedly employ singular value decomposition (SVD) and a proposed rank-based regularization to identify and decay low-contributing components. This strategy retains performance while zeroing out additional inference overhead. Extensive experiments across state-of-the-art (SOTA) video DMs, with parameter sizes ranging from , show that QVGen is the first to reach full-precision comparable quality under 4-bit settings. Moreover, it significantly outperforms existing methods. For instance, our 3-bit CogVideoX-2B achieves improvements of in Dynamic Degree and in Scene Consistency on VBench. Code and models are available atthis https URL.
View on arXiv