128

GLM-TTS Technical Report

Jiayan Cui
Zhihan Yang
Naihan Li
Jiankun Tian
Xingyu Ma
Yi Zhang
Guangyu Chen
Runxuan Yang
Yuqing Cheng
Yizhi Zhou
Guochen Yu
Xiaotao Gu
Jie Tang
Main:10 Pages
5 Figures
Bibliography:4 Pages
9 Tables
Abstract

This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available atthis https URL. Real-time speech synthesis demos are provided viathis http URL(this http URL), the Zhipu Qingyan app/web (this http URL).

View on arXiv
Comments on this paper