179

LD-LAudio-V1: Video-to-Long-Form-Audio Generation Extension with Dual Lightweight Adapters

Main:4 Pages
3 Figures
Bibliography:1 Pages
2 Tables
Abstract

Generating high-quality and temporally synchronized audio from video content is essential for video editing and post-production tasks, enabling the creation of semantically aligned audio for silent videos. However, most existing approaches focus on short-form audio generation for video segments under 10 seconds or rely on noisy datasets for long-form video-to-audio zsynthesis. To address these limitations, we introduce LD-LAudio-V1, an extension of state-of-the-art video-to-audio models and it incorporates dual lightweight adapters to enable long-form audio generation. In addition, we release a clean and human-annotated video-to-audio dataset that contains pure sound effects without noise or artifacts. Our method significantly reduces splicing artifacts and temporal inconsistencies while maintaining computational efficiency. Compared to direct fine-tuning with short training videos, LD-LAudio-V1 achieves significant improvements across multiple metrics: FDpasstFD_{\text{passt}} 450.00 \rightarrow 327.29 (+27.27%), FDpannsFD_{\text{panns}} 34.88 \rightarrow 22.68 (+34.98%), FDvggFD_{\text{vgg}} 3.75 \rightarrow 1.28 (+65.87%), KLpannsKL_{\text{panns}} 2.49 \rightarrow 2.07 (+16.87%), KLpasstKL_{\text{passt}} 1.78 \rightarrow 1.53 (+14.04%), ISpannsIS_{\text{panns}} 4.17 \rightarrow 4.30 (+3.12%), IBscoreIB_{\text{score}} 0.25 \rightarrow 0.28 (+12.00%), EnergyΔ10msEnergy\Delta10\text{ms} 0.3013 \rightarrow 0.1349 (+55.23%), EnergyΔ10ms(this http URL)Energy\Delta10\text{ms(this http URL)} 0.0531 \rightarrow 0.0288 (+45.76%), and Sem.Rel.Sem.\,Rel. 2.73 \rightarrow 3.28 (+20.15%). Our dataset aims to facilitate further research in long-form video-to-audio generation and is available atthis https URL.

View on arXiv
Comments on this paper