ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.17551
40
0

Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion

21 March 2025
Yu Sun
Yin Li
R.-H. Sun
Chunhui Liu
Fangming Zhou
Ze Jin
Linjie Wang
Xiang Shen
Zhuolin Hao
Hongyu Xiong
    VLM
ArXivPDFHTML
Abstract

Transformer-based multimodal models are widely used in industrial-scale recommendation, search, and advertising systems for content understanding and relevance ranking. Enhancing labeled training data quality and cross-modal fusion significantly improves model performance, influencing key metrics such as quality view rates and ad revenue. High-quality annotations are crucial for advancing content modeling, yet traditional statistical-based active learning (AL) methods face limitations: they struggle to detect overconfident misclassifications and are less effective in distinguishing semantically similar items in deep neural networks. Additionally, audio information plays an increasing role, especially in short-video platforms, yet most pre-trained multimodal architectures primarily focus on text and images. While training from scratch across all three modalities is possible, it sacrifices the benefits of leveraging existing pre-trained visual-language (VL) and audio models. To address these challenges, we propose kNN-based Latent Space Broadening (LSB) to enhance AL efficiency and Vision-Language Modeling with Audio Enhancement (VLMAE), a mid-fusion approach integrating audio into VL models. This system deployed in production systems, leading to significant business gains.

View on arXiv
@article{sun2025_2503.17551,
  title={ Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion },
  author={ Yu Sun and Yin Li and Ruixiao Sun and Chunhui Liu and Fangming Zhou and Ze Jin and Linjie Wang and Xiang Shen and Zhuolin Hao and Hongyu Xiong },
  journal={arXiv preprint arXiv:2503.17551},
  year={ 2025 }
}
Comments on this paper