v1v2 (latest)

MOVA: Towards Scalable and Synchronized Video-Audio Generation

9 February 2026

SII-OpenMOSS Team

Donghua Yu

Mingshu Chen

Qi Chen

Qi Luo

Qianyi Wu

Qinyuan Cheng

Ruixiao Li

Tianyi Liang

Wenbo Zhang

Wenming Tu

Xiangyu Peng

Yang Gao

Yanru Huo

Ying Zhu

Yinze Luo

Yiyang Zhang

Yuerong Song

Zhe Xu

Zhiyu Zhang

Chenchen Yang

Cheng Chang

Chushu Zhou

Hanfu Chen

Hongnan Ma

Jiaxi Li

Jingqi Tong

Junxi Liu

Ke Chen

Shimin Li

Shiqi Jiang

Songlin Wang

Wei Jiang

Zhaoye Fei

Zhiyuan Ning

Chunguo Li

Chenhui Li

Ziwei He

Zengfeng Huang

Xie Chen

Xipeng Qiu

VGen

ArXiv (abs)PDF HTML HuggingFace (150 upvotes)Github (598★)

Main:22 Pages

10 Figures

Bibliography:6 Pages

8 Tables

Appendix:10 Pages

Abstract

Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.

View on arXiv

Comments on this paper