Video-Guided Text-to-Music Generation Using Public Domain Movie Collections
- VGen

Despite recent advancements in music generation systems, their application in film production remains limited, as they struggle to capture the nuances of real-world filmmaking, where filmmakers consider multiple factors-such as visual content, dialogue, and emotional tone-when selecting or composing music for a scene. This limitation primarily stems from the absence of comprehensive datasets that integrate these elements. To address this gap, we introduce Open Screen Soundtrack Library (OSSL), a dataset consisting of movie clips from public domain films, totaling approximately 36.5 hours, paired with high-quality soundtracks and human-annotated mood information. To demonstrate the effectiveness of our dataset in improving the performance of pre-trained models on film music generation tasks, we introduce a new video adapter that enhances an autoregressive transformer-based text-to-music model by adding video-based conditioning. Our experimental results demonstrate that our proposed approach effectively enhances MusicGen-Medium in terms of both objective measures of distributional and paired fidelity, and subjective compatibility in mood and genre. To facilitate reproducibility and foster future work, we publicly release the dataset, code, and demo.
View on arXiv@article{kim2025_2506.12573, title={ Video-Guided Text-to-Music Generation Using Public Domain Movie Collections }, author={ Haven Kim and Zachary Novack and Weihan Xu and Julian McAuley and Hao-Wen Dong }, journal={arXiv preprint arXiv:2506.12573}, year={ 2025 } }