Self-supervised learning (SSL) achieves great success in speech recognition, while limited exploration has been attempted for other speech processing tasks. As speech signal contains multi-faceted information including speaker identity, paralinguistics, spoken content, etc., learning universal representations for all speech tasks is challenging. In this paper, we propose a new pre-trained model, WavLM, to solve full-stack downstream speech tasks. WavLM extends HuBERT framework to denoising masked speech modeling, where the target is to predict pseudo-labels of simulated noisy speech on masked regions. The simulated speech is created by adding additional noise or speech from other utterances on the original speech. The denoising masked speech modeling tasks aim to improve the model robustness to complex acoustic environments and the preservation of speaker identity. We scale up the training dataset from 60k hours to 94k hours. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks. The code and pretrained models are available at https://aka.ms/wavlm.

View on arXiv

Comments on this paper