EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting

Human speech goes beyond the mere transfer of information; it is a profound exchange of emotions and a connection between individuals. While Text-to-Speech (TTS) models have made huge progress, they still face challenges in controlling the emotional expression in the generated speech. In this work, we propose EmoVoice, a novel emotion-controllable TTS model that exploits large language models (LLMs) to enable fine-grained freestyle natural language emotion control, and a phoneme boost variant design that makes the model output phoneme tokens and audio tokens in parallel to enhance content consistency, inspired by chain-of-thought (CoT) and chain-of-modality (CoM) techniques. Besides, we introduce EmoVoice-DB, a high-quality 40-hour English emotion dataset featuring expressive speech and fine-grained emotion labels with natural language descriptions. EmoVoice achieves state-of-the-art performance on the English EmoVoice-DB test set using only synthetic training data, and on the Chinese Secap test set using our in-house data. We further investigate the reliability of existing emotion evaluation metrics and their alignment with human perceptual preferences, and explore using SOTA multimodal LLMs GPT-4o-audio and Gemini to assess emotional speech. Demo samples are available atthis https URL. Dataset, code, and checkpoints will be released.
View on arXiv@article{yang2025_2504.12867, title={ EmoVoice: LLM-based Emotional Text-To-Speech Model with Freestyle Text Prompting }, author={ Guanrou Yang and Chen Yang and Qian Chen and Ziyang Ma and Wenxi Chen and Wen Wang and Tianrui Wang and Yifan Yang and Zhikang Niu and Wenrui Liu and Fan Yu and Zhihao Du and Zhifu Gao and ShiLiang Zhang and Xie Chen }, journal={arXiv preprint arXiv:2504.12867}, year={ 2025 } }