FSVideo: Fast Speed Video Diffusion Model in a Highly-Compressed Latent Space

2 February 2026

FSVideo Team

Qingyu Chen

Zhiyuan Fang

Haibin Huang

Xinwei Huang

Tong Jin

Minxuan Lin

Bo Liu

Celong Liu

Chongyang Ma

Xing Mei

Xiaohui Shen

Yaojie Shen

Fuwen Tan

Angtian Wang

Xiao Yang

Yiding Yang

Jiamin Yuan

Lingxi Zhang

Yuxin Zhang

VGen

ArXiv (abs)PDF HTML HuggingFace (18 upvotes)Github

Main:16 Pages

10 Figures

Bibliography:6 Pages

4 Tables

Abstract

We introduce FSVideo, a fast speed transformer-based image-to-video (I2V) diffusion framework. We build our framework on the following key components: 1.) a new video autoencoder with highly-compressed latent space ( $64\times64\times4$ spatial-temporal downsampling ratio), achieving competitive reconstruction quality; 2.) a diffusion transformer (DIT) architecture with a new layer memory design to enhance inter-layer information flow and context reuse within DIT, and 3.) a multi-resolution generation strategy via a few-step DIT upsampler to increase video fidelity. Our final model, which contains a 14B DIT base model and a 14B DIT upsampler, achieves competitive performance against other popular open-source models, while being an order of magnitude faster. We discuss our model design as well as training strategies in this report.

View on arXiv

Comments on this paper