v1v2v3 (latest)

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Factorized Discrete Flow Matching

11 September 2025

Ngoc Son Nguyen

Hieu-Nghia Huynh-Nguyen

Thanh V. T. Tran

Truong-Son Hy

Van Nguyen

ArXiv (abs)PDF HTML Github (476★)

Main:7 Pages

6 Figures

Bibliography:4 Pages

8 Tables

Appendix:7 Pages

Abstract

This paper introduces DiFlow-TTS, a novel zero-shot text-to-speech (TTS) system that employs discrete flow matching for generative speech modeling. We position this work as an entry point that may facilitate further advances in this research direction. Through extensive empirical evaluation, we analyze both the strengths and limitations of this approach across key aspects, including naturalness, expressive attributes, speaker identity, and inference latency. To this end, we leverage factorized speech representations and design a deterministic Phoneme-Content Mapper for modeling linguistic content, together with a Factorized Discrete Flow Denoiser that jointly models multiple discrete token streams corresponding to prosody and acoustics to capture expressive speech attributes. Experimental results demonstrate that DiFlow-TTS achieves strong performance across multiple metrics while maintaining a compact model size, up to 11.7 times smaller, and enabling low-latency inference that is up to 34 times faster than recent state-of-the-art baselines. Audio samples are available on our demo page:this https URL.

View on arXiv

Comments on this paper