Towards Learning a Universal Non-Semantic Representation of Speech
- SSL
The ultimate goal of transfer learning is to reduce labeled data requirements by exploiting a pre-existing embedding model trained for different datasets or tasks. The visual and language communities have established benchmarks to compare embed-dings, but the speech community has yet to do so. This paper proposes a benchmark for comparing speech representations on non-semantic tasks, and proposes a representation based on an unsupervised triplet-loss objective. The proposed representation outperforms other representations on the benchmark, and even exceeds state-of-the-art performance on a number of transfer learning tasks. The embedding is trained on a publicly avail-able dataset, and it is tested on a variety of low-resource down-stream tasks, including personalization tasks and medical do-main. The benchmark4, models5, and evaluation code6are publicly released.
View on arXiv