: Benchmarking the Robustness of Audio-Visual Recognition Models at Test-Time
- TTAVLM

While recent audio-visual models have demonstrated impressive performance, their robustness to distributional shifts at test-time remains not fully understood. Existing robustness benchmarks mainly focus on single modalities, making them insufficient for thoroughly assessing the robustness of audio-visual models. Motivated by real-world scenarios where shifts can occur in both audio and visual modalities, we introduce , a comprehensive benchmark designed to evaluate the test-time robustness of audio-visual recognition models. comprises four audio-visual benchmark datasets, , , , and , each incorporating 75 bimodal audio-visual corruptions that are and . Through extensive evaluations, we observe that state-of-the-art supervised and self-supervised audio-visual models exhibit declining robustness as corruption severity increases. Furthermore, online test-time adaptation (TTA) methods, on and , offer minimal improvements in performance under bimodal corruptions. We further propose , a simple TTA approach enabling on-the-fly cross-modal fusion by penalizing high-entropy samples, which achieves improvements on . We hope that will steer the development of more effective and robust audio-visual TTA approaches. Our code is available .
View on arXiv