5
0

Optimizing Multilingual Text-To-Speech with Accents & Emotions

Main:9 Pages
12 Figures
Bibliography:3 Pages
Abstract

State-of-the-art text-to-speech (TTS) systems realize high naturalness in monolingual environments, synthesizing speech with correct multilingual accents (especially for Indic languages) and context-relevant emotions still poses difficulty owing to cultural nuance discrepancies in current frameworks. This paper introduces a new TTS architecture integrating accent along with preserving transliteration with multi-scale emotion modelling, in particularly tuned for Hindi and Indian English accent. Our approach extends the Parler-TTS model by integrating A language-specific phoneme alignment hybrid encoder-decoder architecture, and culture-sensitive emotion embedding layers trained on native speaker corpora, as well as incorporating a dynamic accent code switching with residual vector quantization. Quantitative tests demonstrate 23.7% improvement in accent accuracy (Word Error Rate reduction from 15.4% to 11.8%) and 85.3% emotion recognition accuracy from native listeners, surpassing METTS and VECL-TTS baselines. The novelty of the system is that it can mix code in real time - generating statements such as "Namaste, let's talk about <Hindi phrase>" with uninterrupted accent shifts while preserving emotional consistency. Subjective evaluation with 200 users reported a mean opinion score (MOS) of 4.2/5 for cultural correctness, much better than existing multilingual systems (p<0.01). This research makes cross-lingual synthesis more feasible by showcasing scalable accent-emotion disentanglement, with direct application in South Asian EdTech and accessibility software.

View on arXiv
@article{pawar2025_2506.16310,
  title={ Optimizing Multilingual Text-To-Speech with Accents & Emotions },
  author={ Pranav Pawar and Akshansh Dwivedi and Jenish Boricha and Himanshu Gohil and Aditya Dubey },
  journal={arXiv preprint arXiv:2506.16310},
  year={ 2025 }
}
Comments on this paper