SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection

Synthesizing the voices of unseen speakers remains a persisting challenge in multi-speaker text-to-speech (TTS). Existing methods model speaker characteristics through speaker conditioning during training, leading to increased model complexity and limiting reproducibility and accessibility. A lower-complexity method would enable speech synthesis research with limited computational and data resources to reach to a wider use. To this end, we propose SelectTTS, a simple and effective alternative. SelectTTS selects appropriate frames from the target speaker and decodes them using frame-level self-supervised learning (SSL) features. We demonstrate that this approach can effectively capture speaker characteristics for unseen speakers and achieves performance comparable to state-of-the-art multi-speaker TTS frameworks on both objective and subjective metrics. By directly selecting frames from the target speaker's speech, SelectTTS enables generalization to unseen speakers with significantly lower model complexity. Compared to baselines such as XTTS-v2 and VALL-E, SelectTTS achieves better speaker similarity while reducing model parameters by over 8x and training data requirements by 270x.
View on arXiv@article{ulgen2025_2408.17432, title={ SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection }, author={ Ismail Rasim Ulgen and Shreeram Suresh Chandra and Junchen Lu and Berrak Sisman }, journal={arXiv preprint arXiv:2408.17432}, year={ 2025 } }