58
v1v2 (latest)

MOVA: Towards Scalable and Synchronized Video-Audio Generation

SII-OpenMOSS Team
Donghua Yu
Mingshu Chen
Qi Chen
Qi Luo
Qianyi Wu
Qinyuan Cheng
Ruixiao Li
Tianyi Liang
Wenbo Zhang
Wenming Tu
Xiangyu Peng
Yang Gao
Yanru Huo
Ying Zhu
Yinze Luo
Yiyang Zhang
Yuerong Song
Zhe Xu
Zhiyu Zhang
Chenchen Yang
Cheng Chang
Chushu Zhou
Hanfu Chen
Hongnan Ma
Jiaxi Li
Jingqi Tong
Junxi Liu
Ke Chen
Shimin Li
Shiqi Jiang
Songlin Wang
Wei Jiang
Zhaoye Fei
Zhiyuan Ning
Chunguo Li
Chenhui Li
Ziwei He
Zengfeng Huang
Xie Chen
Xipeng Qiu
Main:22 Pages
10 Figures
Bibliography:6 Pages
8 Tables
Appendix:10 Pages
Abstract

Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.

View on arXiv
Comments on this paper