ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.15118
17
0

Improving Sound Source Localization with Joint Slot Attention on Image and Audio

21 April 2025
Inho Kim
Youngkil Song
Jicheol Park
Won Hwa Kim
Suha Kwak
ArXivPDFHTML
Abstract

Sound source localization (SSL) is the task of locating the source of sound within an image. Due to the lack of localization labels, the de facto standard in SSL has been to represent an image and audio as a single embedding vector each, and use them to learn SSL via contrastive learning. To this end, previous work samples one of local image features as the image embedding and aggregates all local audio features to obtain the audio embedding, which is far from optimal due to the presence of noise and background irrelevant to the actual target in the input. We present a novel SSL method that addresses this chronic issue by joint slot attention on image and audio. To be specific, two slots competitively attend image and audio features to decompose them into target and off-target representations, and only target representations of image and audio are used for contrastive learning. Also, we introduce cross-modal attention matching to further align local features of image and audio. Our method achieved the best in almost all settings on three public benchmarks for SSL, and substantially outperformed all the prior work in cross-modal retrieval.

View on arXiv
@article{kim2025_2504.15118,
  title={ Improving Sound Source Localization with Joint Slot Attention on Image and Audio },
  author={ Inho Kim and Youngkil Song and Jicheol Park and Won Hwa Kim and Suha Kwak },
  journal={arXiv preprint arXiv:2504.15118},
  year={ 2025 }
}
Comments on this paper