FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching

To advance continuous-valued token modeling and temporal-coherence enforcement, we propose FELLE, an autoregressive model that integrates language modeling with token-wise flow matching. By leveraging the autoregressive nature of language models and the generative efficacy of flow matching, FELLE effectively predicts continuous-valued tokens (mel-spectrograms). For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step, improving coherence and stability. Furthermore, to enhance synthesis quality, FELLE introduces a coarse-to-fine flow-matching mechanism, generating continuous-valued tokens hierarchically, conditioned on the language model's output. Experimental results demonstrate the potential of incorporating flow-matching techniques in autoregressive mel-spectrogram modeling, leading to significant improvements in TTS generation quality, as shown inthis https URL.
View on arXiv@article{wang2025_2502.11128, title={ FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching }, author={ Hui Wang and Shujie Liu and Lingwei Meng and Jinyu Li and Yifan Yang and Shiwan Zhao and Haiyang Sun and Yanqing Liu and Haoqin Sun and Jiaming Zhou and Yan Lu and Yong Qin }, journal={arXiv preprint arXiv:2502.11128}, year={ 2025 } }