GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture

15 April 2025

Abstract

While large language models (LLMs) have revolutionized text-to-speech (TTS) synthesis through discrete tokenization paradigms, current architectures exhibit fundamental tensions between three critical dimensions: 1) irreversible loss of acoustic characteristics caused by quantization of speech prompts; 2) stringent dependence on precisely aligned prompt speech-text pairs that limit real-world deployment; and 3) catastrophic forgetting of the LLM's native text comprehension during optimization for speech token generation. To address these challenges, we propose an LLM-based text-to-speech Generation approach Optimized via a novel dual-branch ArchiTecture (GOAT-TTS). Our framework introduces two key innovations: (1) The modality-alignment branch combines a speech encoder and projector to capture continuous acoustic embeddings, enabling bidirectional correlation between paralinguistic features (language, timbre, emotion) and semantic text representations without transcript dependency; (2) The speech-generation branch employs modular fine-tuning on top-k layers of an LLM for speech token prediction while freezing the bottom-k layers to preserve foundational linguistic knowledge. Moreover, multi-token prediction is introduced to support real-time streaming TTS synthesis. Experimental results demonstrate that our GOAT-TTS achieves performance comparable to state-of-the-art TTS models while validating the efficacy of synthesized dialect speech data.

View on arXiv

@article{song2025_2504.12339,
  title={ GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture },
  author={ Yaodong Song and Hongjie Chen and Jie Lian and Yuxin Zhang and Guangmin Xia and Zehan Li and Genliang Zhao and Jian Kang and Yongxiang Li and Jie Li },
  journal={arXiv preprint arXiv:2504.12339},
  year={ 2025 }
}

Comments on this paper