GLM-TTS Technical Report

16 December 2025

Jiayan Cui

Zhihan Yang

Naihan Li

Jiankun Tian

Xingyu Ma

Yi Zhang

Guangyu Chen

Runxuan Yang

Yuqing Cheng

Yizhi Zhou

Guochen Yu

Xiaotao Gu

Jie Tang

ArXiv (abs)PDF HTML Github (691★)

Main:10 Pages

5 Figures

Bibliography:4 Pages

9 Tables

Abstract

This work proposes GLM-TTS, a production-level TTS system designed for efficiency, controllability, and high-fidelity speech generation. GLM-TTS follows a two-stage architecture, consisting of a text-to-token autoregressive model and a token-to-waveform diffusion model. With only 100k hours of training data, GLM-TTS achieves state-of-the-art performance on multiple open-source benchmarks. To meet production requirements, GLM-TTS improves speech quality through an optimized speech tokenizer with fundamental frequency constraints and a GRPO-based multi-reward reinforcement learning framework that jointly optimizes pronunciation, speaker similarity, and expressive prosody. In parallel, the system enables efficient and controllable deployment via parameter-efficient LoRA-based voice customization and a hybrid phoneme-text input scheme that provides precise pronunciation control. Our code is available atthis https URL. Real-time speech synthesis demos are provided viathis http URL(this http URL), the Zhipu Qingyan app/web (this http URL).

View on arXiv

Comments on this paper