Open-vocabulary keyword spotting in any language through multilingual
contrastive speech-phoneme pretraining
North American Chapter of the Association for Computational Linguistics (NAACL), 2023
Main:8 Pages
3 Figures
Bibliography:7 Pages
12 Tables
Appendix:8 Pages
Abstract
In this paper, we introduce a massively multilingual speech corpora with fine-grained phonemic transcriptions, encompassing more than 115 languages from diverse language families. Based on this multilingual dataset, we propose CLAP-IPA, a multilingual phoneme-speech contrastive embedding model capable of open-vocabulary matching between speech signals and phonemically transcribed keywords or arbitrary phrases. The proposed model has been tested on two fieldwork speech corpora in 97 unseen languages, exhibiting strong generalizability across languages. Comparison with a text-based model shows that using phonemes as modeling units enables much better crosslinguistic generalization than orthographic texts.
View on arXivComments on this paper
