v1v2 (latest)

ALIVE: Animate Your World with Lifelike Audio-Video Generation

9 February 2026

Ying Guo

Qijun Gan

Yifu Zhang

Jinlai Liu

Yifei Hu

Pan Xie

Dongjun Qian

Yu Zhang

Ruiqi Li

Yuqi Zhang

Ruibiao Lu

Xiaofeng Mei

Bo Han

Xiang Yin

Bingyue Peng

Zehuan Yuan

VGen

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)Github (57★)

Main:26 Pages

17 Figures

Bibliography:2 Pages

2 Tables

Abstract

Video generation is rapidly evolving towards unified audio-video generation. In this paper, we present ALIVE, a generation model that adapts a pretrained Text-to-Video (T2V) model to Sora-style audio-video generation and animation. In particular, the model unlocks the Text-to-Video&Audio (T2VA) and Reference-to-Video&Audio (animation) capabilities compared to the T2V foundation models. To support the audio-visual synchronization and reference animation, we augment the popular MMDiT architecture with a joint audio-video branch which includes TA-CrossAttn for temporally-aligned cross-modal fusion and UniTemp-RoPE for precise audio-visual alignment. Meanwhile, a comprehensive data pipeline consisting of audio-video captioning, quality control, etc., is carefully designed to collect high-quality finetuning data. Additionally, we introduce a new benchmark to perform a comprehensive model test and comparison. After continue pretraining and finetuning on million-level high-quality data, ALIVE demonstrates outstanding performance, consistently outperforming open-source models and matching or surpassing state-of-the-art commercial solutions. With detailed recipes and benchmarks, we hope ALIVE helps the community develop audio-video generation models more efficiently. Official page:this https URL.

View on arXiv

Comments on this paper