Recent advancements in text-to-speech (TTS) models have been driven by the integration of large language models (LLMs), enhancing semantic comprehension and improving speech naturalness. However, existing LLM-based TTS models often lack open-source training code and efficient inference acceleration frameworks, limiting their accessibility and adaptability. Additionally, there is no publicly available TTS model specifically optimized for podcast scenarios, which are in high demand for voice interaction applications. To address these limitations, we introduce Muyan-TTS, an open-source trainable TTS model designed for podcast applications within a 50,000budget.Ourmodelispre−trainedonover100,000hoursofpodcastaudiodata,enablingzero−shotTTSsynthesiswithhigh−qualityvoicegeneration.Furthermore,Muyan−TTSsupportsspeakeradaptationwithdozensofminutesoftargetspeech,makingithighlycustomizableforindividualvoices.Inadditiontoopen−sourcingthemodel,weprovideacomprehensivedatacollectionandprocessingpipeline,afulltrainingprocedure,andanoptimizedinferenceframeworkthatacceleratesLLM−basedTTSsynthesis.OurcodeandmodelsareavailableatthishttpsURL.
@article{li2025_2504.19146,
title={ Muyan-TTS: A Trainable Text-to-Speech Model Optimized for Podcast Scenarios with a $50K Budget },
author={ Xin Li and Kaikai Jia and Hao Sun and Jun Dai and Ziyang Jiang },
journal={arXiv preprint arXiv:2504.19146},
year={ 2025 }
}