Open-vocabulary keyword spotting in any language through multilingual contrastive speech-phoneme pretraining

North American Chapter of the Association for Computational Linguistics (NAACL), 2023

14 November 2023

ArXiv (abs)PDF HTML Github (11735★)

Main:8 Pages

3 Figures

Bibliography:7 Pages

12 Tables

Appendix:8 Pages

Abstract

In this paper, we introduce a massively multilingual speech corpora with fine-grained phonemic transcriptions, encompassing more than 115 languages from diverse language families. Based on this multilingual dataset, we propose CLAP-IPA, a multilingual phoneme-speech contrastive embedding model capable of open-vocabulary matching between speech signals and phonemically transcribed keywords or arbitrary phrases. The proposed model has been tested on two fieldwork speech corpora in 97 unseen languages, exhibiting strong generalizability across languages. Comparison with a text-based model shows that using phonemes as modeling units enables much better crosslinguistic generalization than orthographic texts.

View on arXiv

Comments on this paper