ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2507.02271
6
0

Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation

3 July 2025
Feizhen Huang
Yu Wu
Yutian Lin
Bo Du
ArXiv (abs)PDFHTML
Main:7 Pages
6 Figures
Bibliography:2 Pages
4 Tables
Abstract

Video-to-Audio (V2A) Generation achieves significant progress and plays a crucial role in film and video post-production. However, current methods overlook the cinematic language, a critical component of artistic expression in filmmaking. As a result, their performance deteriorates in scenarios where Foley targets are only partially visible. To address this challenge, we propose a simple self-distillation approach to extend V2A models to cinematic language scenarios. By simulating the cinematic language variations, the student model learns to align the video features of training pairs with the same audio-visual correspondences, enabling it to effectively capture the associations between sounds and partial visual information. Our method not only achieves impressive improvements under partial visibility across all evaluation metrics, but also enhances performance on the large-scale V2A dataset, VGGSound.

View on arXiv
@article{huang2025_2507.02271,
  title={ Spotlighting Partially Visible Cinematic Language for Video-to-Audio Generation via Self-distillation },
  author={ Feizhen Huang and Yu Wu and Yutian Lin and Bo Du },
  journal={arXiv preprint arXiv:2507.02271},
  year={ 2025 }
}
Comments on this paper