On the Cost and Benefits of Training Context with Utterance or Full Conversation Training: A Comparative Stud

Modern TTS systems designed for conversations achieve high-quality utterances but often remain inaccessible publicly. Are existing open-source architectures inadequate, or are current training techniques insufficient? This paper investigates prominent models and their underlying behaviors regarding conversational context. Using 20 GPU-hours on an NVIDIA H100, we empirically examine two approaches: context-based utterance-level training versus full conversation training. Results demonstrate that context-based utterance training achieves superior MOS scores (4.3/5.0 vs 3.7/5.0) and reduces training time by 37%, while full conversation approaches suffer from speaker similarity hallucination issues. These findings provide practical guidelines for conversational TTS development, favoring utterance-level training with contextual conditioning for both resource efficiency and output quality.
View on arXiv@article{liu2025_2505.07202, title={ On the Cost and Benefits of Training Context with Utterance or Full Conversation Training: A Comparative Stud }, author={ Hyouin Liu and Zhikuan Zhang }, journal={arXiv preprint arXiv:2505.07202}, year={ 2025 } }