LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models

Recent advances in Latent Video Diffusion Models (LVDMs) have revolutionized video generation by leveraging Video Variational Autoencoders (Video VAEs) to compress intricate video data into a compact latentthis http URL, as LVDM training scales, the computational overhead of Video VAEs becomes a critical bottleneck, particularly for encoding high-resolution videos. To address this, we propose LeanVAE, a novel and ultra-efficient Video VAE framework that introduces two key innovations: (1) a lightweight architecture based on a Neighborhood-Aware Feedforward (NAF) module and non-overlapping patch operations, drastically reducing computational cost, and (2) the integration of wavelet transforms and compressed sensing techniques to enhance reconstruction quality. Extensive experiments validate LeanVAE's superiority in video reconstruction and generation, particularly in enhancing efficiency over existing Videothis http URLmodel offers up to 50x fewer FLOPs and 44x faster inference speed while maintaining competitive reconstruction quality, providing insights for scalable, efficient videothis http URLmodels and code are available atthis https URL.
View on arXiv@article{cheng2025_2503.14325, title={ LeanVAE: An Ultra-Efficient Reconstruction VAE for Video Diffusion Models }, author={ Yu Cheng and Fajie Yuan }, journal={arXiv preprint arXiv:2503.14325}, year={ 2025 } }