14
v1v2 (latest)

ALIVE: Animate Your World with Lifelike Audio-Video Generation

Ying Guo
Qijun Gan
Yifu Zhang
Jinlai Liu
Yifei Hu
Pan Xie
Dongjun Qian
Yu Zhang
Ruiqi Li
Yuqi Zhang
Ruibiao Lu
Xiaofeng Mei
Bo Han
Xiang Yin
Bingyue Peng
Zehuan Yuan
Main:26 Pages
17 Figures
Bibliography:2 Pages
2 Tables
Abstract

Video generation is rapidly evolving towards unified audio-video generation. In this paper, we present ALIVE, a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation. In particular, the model unlocks the Text-to-Video&Audio (T2VA) and Reference-to-Video&Audio (animation) capabilities compared to the T2V foundation models. To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch which includes TA-CrossAttn for temporally-aligned cross-modal fusion and UniTemp-RoPE for precise audio-visual alignment. Meanwhile, a comprehensive data pipeline consisting of audio-video captioning, quality control, etc., is carefully designed to collect high-quality finetuning data. Additionally, we introduce a new benchmark to perform a comprehensive model test and comparison. After continue pretraining and finetuning on million-level high-quality data, ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions. With detailed recipes and benchmarks, we hope ALIVE helps the community develop audio-video generation models more efficiently. Official page:this https URL.

View on arXiv
Comments on this paper