DualDub: Video-to-Soundtrack Generation via Joint Speech and Background Audio Synthesis

14 July 2025

Wenjie Tian

Xinfa Zhu

Haohe Liu

Zhixian Zhao

Zihao Chen

Chaofan Ding

Xinhan Di

Junjie Zheng

Lei Xie

DiffM

VGen

ArXiv (abs)PDF HTML

Main:8 Pages

4 Figures

Bibliography:4 Pages

5 Tables

Abstract

While recent video-to-audio (V2A) models can generate realistic background audio from visual input, they largely overlook speech, an essential part of many video soundtracks. This paper proposes a new task, video-to-soundtrack (V2ST) generation, which aims to jointly produce synchronized background audio and speech within a unified framework. To tackle V2ST, we introduce DualDub, a unified framework built on a multimodal language model that integrates a multimodal encoder, a cross-modal aligner, and dual decoding heads for simultaneous background audio and speech generation. Specifically, our proposed cross-modal aligner employs causal and non-causal attention mechanisms to improve synchronization and acoustic harmony. Besides, to handle data scarcity, we design a curriculum learning strategy that progressively builds the multimodal capability. Finally, we introduce DualBench, the first benchmark for V2ST evaluation with a carefully curated test set and comprehensive metrics. Experimental results demonstrate that DualDub achieves state-of-the-art performance, generating high-quality and well-synchronized soundtracks with both speech and background audio.

View on arXiv

Comments on this paper