181

ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis

International Symposium on Chinese Spoken Language Processing (ISCSLP), 2022
Abstract

In recent years, the neural network-based model for multi-speaker text-to-speech synthesis (TTS) has made significant progress. However, the current speaker encoder models used in these methods cannot capture enough speaker information. In this paper, we propose an end-to-end method that is able to generate high-quality speech and better similarity for both seen and unseen speakers by introducing a more powerful speaker encoder. The method consists of three separately trained components: a speaker encoder based on the state-of-the-art TDNN-based ECAPA-TDNN derived from speaker verification task, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder. By comparing different speaker encoder models, our proposed method can achieve better naturalness and similarity in seen and unseen test sets. To efficiently evaluate our synthesized speech, we are the first to adopt deep-learning-based automatic MOS evaluation methods to assess our results, and these methods show great potential in automatic speech quality assessment.

View on arXiv
Comments on this paper