SOUPLE: Enhancing Audio-Visual Localization and Segmentation with Learnable Prompt Contexts

24 March 2026

Khanh Binh Nguyen

Chae Jung Park

SSL

CLIP

VLM

ArXiv (abs)PDF HTML Github

Main:8 Pages

5 Figures

Bibliography:2 Pages

8 Tables

Abstract

Large-scale pre-trained image-text models exhibit robust multimodal representations, yet applying the Contrastive Language-Image Pre-training (CLIP) model to audio-visual localization remains challenging. Replacing the classification token ([CLS]) with an audio-embedded token ([V_A]) struggles to capture semantic cues, and the prompt "a photo of a [V_A]" fails to establish meaningful connections between audio embeddings and context tokens. To address these issues, we propose Sound-aware Prompt Learning (SOUPLE), which replaces fixed prompts with learnable context tokens. These tokens incorporate visual features to generate conditional context for a mask decoder, effectively bridging semantic correspondence between audio and visual inputs. Experiments on VGGSound, SoundNet, and AVSBench demonstrate that SOUPLE improves localization and segmentation performance.

View on arXiv

Comments on this paper