Listenable Maps for Zero-Shot Audio Classifiers

Interpreting the decisions of deep learning models, including audio classifiers, is crucial for ensuring the transparency and trustworthiness of this technology. In this paper, we introduce LMAC-ZS (Listenable Maps for Audio Classifiers in the Zero-Shot context), which, to the best of our knowledge, is the first decoder-based post-hoc interpretation method for explaining the decisions of zero-shot audio classifiers. The proposed method utilizes a novel loss function that maximizes the faithfulness to the original similarity between a given text-and-audio pair. We provide an extensive evaluation using the Contrastive Language-Audio Pretraining (CLAP) model to showcase that our interpreter remains faithful to the decisions in a zero-shot classification context. Moreover, we qualitatively show that our method produces meaningful explanations that correlate well with different text prompts.
View on arXiv@article{paissan2025_2405.17615, title={ Listenable Maps for Zero-Shot Audio Classifiers }, author={ Francesco Paissan and Luca Della Libera and Mirco Ravanelli and Cem Subakan }, journal={arXiv preprint arXiv:2405.17615}, year={ 2025 } }