17

DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis

Wenjie Tian
Xinfa Zhu
Haohe Liu
Zhixian Zhao
Zihao Chen
Chaofan Ding
Xinhan Di
Junjie Zheng
Lei Xie
Main:8 Pages
4 Figures
Bibliography:4 Pages
5 Tables
Abstract

While recent video-to-audio (V2A) models can generate realistic background audio from visual input, they largely overlook speech, an essential part of many video soundtracks. This paper proposes a new task, video-to-soundtrack (V2ST) generation, which aims to jointly produce synchronized background audio and speech within a unified framework. To tackle V2ST, we introduce DualDub, a unified framework built on a multimodal language model that integrates a multimodal encoder, a cross-modal aligner, and dual decoding heads for simultaneous background audio and speech generation. Specifically, our proposed cross-modal aligner employs causal and non-causal attention mechanisms to improve synchronization and acoustic harmony. Besides, to handle data scarcity, we design a curriculum learning strategy that progressively builds the multimodal capability. Finally, we introduce DualBench, the first benchmark for V2ST evaluation with a carefully curated test set and comprehensive metrics. Experimental results demonstrate that DualDub achieves state-of-the-art performance, generating high-quality and well-synchronized soundtracks with both speech and background audio.

View on arXiv
Comments on this paper