ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.23108
41
0

SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System

29 March 2025
H. Kim
Jinhyeok Yang
Yechan Yu
Seunghun Ji
Jacob Morton
Frederik Bous
Joon Byun
Juheon Lee
ArXivPDFHTML
Abstract

We present a novel text-to-speech (TTS) system, namely SupertonicTTS, for improved scalability and efficiency in speech synthesis. SupertonicTTS is comprised of three components: a speech autoencoder for continuous latent representation, a text-to-latent module leveraging flow-matching for text-to-latent mapping, and an utterance-level duration predictor. To enable a lightweight architecture, we employ a low-dimensional latent space, temporal compression of latents, and ConvNeXt blocks. We further simplify the TTS pipeline by operating directly on raw character-level text and employing cross-attention for text-speech alignment, thus eliminating the need for grapheme-to-phoneme (G2P) modules and external aligners. In addition, we introduce context-sharing batch expansion that accelerates loss convergence and stabilizes text-speech alignment. Experimental results demonstrate that SupertonicTTS achieves competitive performance while significantly reducing architectural complexity and computational overhead compared to contemporary TTS models. Audio samples demonstrating the capabilities of SupertonicTTS are available at:this https URL.

View on arXiv
@article{kim2025_2503.23108,
  title={ SupertonicTTS: Towards Highly Scalable and Efficient Text-to-Speech System },
  author={ Hyeongju Kim and Jinhyeok Yang and Yechan Yu and Seunghun Ji and Jacob Morton and Frederik Bous and Joon Byun and Juheon Lee },
  journal={arXiv preprint arXiv:2503.23108},
  year={ 2025 }
}
Comments on this paper