ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.12339
30
0

GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture

15 April 2025
Yaodong Song
Hongjie Chen
Jie Lian
Yuxin Zhang
Guangmin Xia
Zehan Li
Genliang Zhao
Jian Kang
Y. Li
Jie Li
    AuLLM
ArXivPDFHTML
Abstract

While large language models (LLMs) have revolutionized text-to-speech (TTS) synthesis through discrete tokenization paradigms, current architectures exhibit fundamental tensions between three critical dimensions: 1) irreversible loss of acoustic characteristics caused by quantization of speech prompts; 2) stringent dependence on precisely aligned prompt speech-text pairs that limit real-world deployment; and 3) catastrophic forgetting of the LLM's native text comprehension during optimization for speech token generation. To address these challenges, we propose an LLM-based text-to-speech Generation approach Optimized via a novel dual-branch ArchiTecture (GOAT-TTS). Our framework introduces two key innovations: (1) The modality-alignment branch combines a speech encoder and projector to capture continuous acoustic embeddings, enabling bidirectional correlation between paralinguistic features (language, timbre, emotion) and semantic text representations without transcript dependency; (2) The speech-generation branch employs modular fine-tuning on top-k layers of an LLM for speech token prediction while freezing the bottom-k layers to preserve foundational linguistic knowledge. Moreover, multi-token prediction is introduced to support real-time streaming TTS synthesis. Experimental results demonstrate that our GOAT-TTS achieves performance comparable to state-of-the-art TTS models while validating the efficacy of synthesized dialect speech data.

View on arXiv
@article{song2025_2504.12339,
  title={ GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture },
  author={ Yaodong Song and Hongjie Chen and Jie Lian and Yuxin Zhang and Guangmin Xia and Zehan Li and Genliang Zhao and Jian Kang and Yongxiang Li and Jie Li },
  journal={arXiv preprint arXiv:2504.12339},
  year={ 2025 }
}
Comments on this paper