58

FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

FSVideo Team
Qingyu Chen
Zhiyuan Fang
Haibin Huang
Xinwei Huang
Tong Jin
Minxuan Lin
Bo Liu
Celong Liu
Chongyang Ma
Xing Mei
Xiaohui Shen
Yaojie Shen
Fuwen Tan
Angtian Wang
Xiao Yang
Yiding Yang
Jiamin Yuan
Lingxi Zhang
Yuxin Zhang
Main:16 Pages
10 Figures
Bibliography:6 Pages
4 Tables
Abstract

We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space (64×64×464\times64\times4 spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.

View on arXiv
Comments on this paper