Foreign accent conversion (FAC) in speech processing remains a challenging task. Building on the remarkable success of large language models (LLMs) in Text-to-Speech (TTS) tasks, this study investigates the adaptation of LLM-based techniques for FAC, which we term SpeechAccentLLM. At the core of this framework, we introduce SpeechCodeVAE, the first model to integrate connectionist temporal classification (CTC) directly into codebook discretization for speech content tokenization. This novel architecture generates tokens with a unique "locality" property, as validated by experiments demonstrating optimal trade-offs among content faithfulness, temporal coherence, and structural recoverability. Then, to address data scarcity for the FAC module, we adopted a multitask learning strategy that jointly trains the FAC and TTS modules. Beyond mitigating data limitations, this approach yielded accelerated convergence and superior speech quality compared to standalone FAC training. Moreover, leveraging the salient properties of our discrete speech representations, we introduce SpeechRestorer, a postprocessing architecture designed to refine LLM-generated outputs. This module effectively mitigates stochastic errors prevalent in LLM inference pipelines while enhancing prosodic continuity, as validated by ablation experiments.
View on arXiv@article{zhuangfei2025_2507.01348, title={ SpeechAccentLLM: A Unified Framework for Foreign Accent Conversion and Text to Speech }, author={ Cheng Zhuangfei and Zhang Guangyan and Tu Zehai and Song Yangyang and Mao Shuiyang and Jiao Xiaoqi and Li Jingyu and Guo Yiwen and Wu Jiasong }, journal={arXiv preprint arXiv:2507.01348}, year={ 2025 } }