ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2409.00750
  4. Cited By
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec
  Transformer
v1v2v3 (latest)

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

International Conference on Learning Representations (ICLR), 2024
1 September 2024
Yuancheng Wang
Haoyue Zhan
Liwei Liu
Ruihong Zeng
Haotian Guo
Jiachen Zheng
Qiang Zhang
Shunsi Zhang
Shunsi Zhang
Zhizheng Wu
ArXiv (abs)PDFHTMLHuggingFace (4 upvotes)Github (9101★)

Papers citing "MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer"

50 / 60 papers shown
Title
Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale
Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale
Yicheng Zhong
Peiji Yang
Zhisheng Wang
105
0
0
26 Nov 2025
AlignSurvey: A Comprehensive Benchmark for Human Preferences Alignment in Social Surveys
AlignSurvey: A Comprehensive Benchmark for Human Preferences Alignment in Social Surveys
Chenxi Lin
Weikang Yuan
Zhuoren Jiang
Biao Huang
Ruitao Zhang
Jianan Ge
Yueqian Xu
Jianxing Yu
ALM
557
0
0
11 Nov 2025
Step-Audio-EditX Technical Report
Step-Audio-EditX Technical Report
Chao Yan
Boyong Wu
Peng Yang
Pengfei Tan
Guoqiang Hu
...
Xiangyu Zhang
Daxin Jiang
Daxin Jiang
Shuchang Zhou
Gang Yu
128
1
0
05 Nov 2025
Bayesian Speech synthesizers Can Learn from Multiple Teachers
Bayesian Speech synthesizers Can Learn from Multiple Teachers
Ziyang Zhang
Yifan Gao
Xuenan Xu
Baoxiangli
Wen Wu
Chao Zhang
76
0
0
28 Oct 2025
SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity
SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity
Hanke Xie
Haopeng Lin
Wenxiao Cao
Dake Guo
WenJie Tian
...
Shunshun Yin
Ming Tao
Xie Chen
Lei Xie
Xinsheng Wang
145
1
0
27 Oct 2025
Label Smoothing Improves Gradient Ascent in LLM Unlearning
Label Smoothing Improves Gradient Ascent in LLM Unlearning
Zirui Pang
Hao Zheng
Zhijie Deng
Ling Li
Zixin Zhong
Jiaheng Wei
MU
167
0
0
25 Oct 2025
Vox-Evaluator: Enhancing Stability and Fidelity for Zero-shot TTS with A Multi-Level Evaluator
Vox-Evaluator: Enhancing Stability and Fidelity for Zero-shot TTS with A Multi-Level Evaluator
H. Wang
Na Li
Chuke Wang
Shu Wu
Zhifeng Li
Dong Yu
DiffM
120
0
0
23 Oct 2025
EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection
EchoFake: A Replay-Aware Dataset for Practical Speech Deepfake Detection
Tong Zhang
Yihuan Huang
Yanzhen Ren
76
0
0
22 Oct 2025
SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation
SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation
Hui Wang
J. Zhao
Yifan Yang
Shujie Liu
Junyang Chen
...
Jinyu Li
Jiaming Zhou
Haoqin Sun
Yan Lu
Yong Qin
AuLLMELM
194
1
0
16 Oct 2025
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue
InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue
Wenwen Tong
Hewei Guo
Dongchuan Ran
Jiangnan Chen
Jiefan Lu
...
Dinghao Zhou
Guiping Zhong
Ken Zheng
Shiyin Kang
Lewei Lu
MLLMAuLLMVGenVLM
400
4
0
15 Oct 2025
DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation
DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation
Yakun Song
Xiaobin Zhuang
Jiawei Chen
Zhikang Niu
Guanrou Yang
...
Zhuo Chen
Yuping Wang
Yuping Wang
Xie Chen
Xie Chen
DiffM
164
0
0
14 Oct 2025
TokenChain: A Discrete Speech Chain via Semantic Token Modeling
TokenChain: A Discrete Speech Chain via Semantic Token Modeling
Mingxuan Wang
Satoshi Nakamura
88
0
0
07 Oct 2025
Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech
Data-efficient Targeted Token-level Preference Optimization for LLM-based Text-to-Speech
Rikuto Kotoge
Yuichi Sasaki
84
0
0
07 Oct 2025
UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models
UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models
Wenhao Guan
Zhikang Niu
Ziyue Jiang
Kaidi Wang
Peijie Chen
Q. Hong
Lin Li
Xie Chen
AuLLM
257
0
0
06 Oct 2025
Beyond Static Knowledge Messengers: Towards Adaptive, Fair, and Scalable Federated Learning for Medical AI
Beyond Static Knowledge Messengers: Towards Adaptive, Fair, and Scalable Federated Learning for Medical AI
Jahidul Arafat
Fariha Tasmin
Sanjaya Poudel
Ahsan Habib Tareq
FedML
195
0
0
05 Oct 2025
FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates
FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates
Jiaqi Li
Y. Qian
Yuxuan Hu
Leying Zhang
Xiaofei Wang
Heng Lu
Manthan Thakker
Jinyu Li
Sheng Zhao
Zhizheng Wu
194
1
0
01 Oct 2025
VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning
Yixuan Zhou
Guoyang Zeng
Xin Liu
Xiang Li
Renjie Yu
...
Weiyue Sun
Jiancheng Gui
Kehan Li
Z. Wu
Zhiyuan Liu
117
1
0
29 Sep 2025
VoiceBridge: Designing Latent Bridge Models for General Speech Restoration at Scale
VoiceBridge: Designing Latent Bridge Models for General Speech Restoration at Scale
Chi Zhang
Zehua Chen
Kaiwen Zheng
Jun Zhu
AuLLM
158
0
0
28 Sep 2025
Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling
Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling
Junjie Cao
Yichen Han
Ruonan Zhang
Xiaoyang Hao
Hongxiang Li
Shuaijiang Zhao
Yue Liu
Xiao-Ping Zhng
99
0
0
26 Sep 2025
Audio Super-Resolution with Latent Bridge Models
Audio Super-Resolution with Latent Bridge Models
Chang Li
Zehua Chen
Liyuan Wang
Jun Zhu
312
3
0
22 Sep 2025
MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances
MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances
Junhyeok Lee
Helin Wang
Yaohan Guan
Thomas Thebaud
Laureano Moro-Velazquez
Jesus Villalba
Najim Dehak
84
0
0
21 Sep 2025
VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency
VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency
Nikita Torgashov
Gustav Eje Henter
Gabriel Skantze
VLM
128
0
0
19 Sep 2025
Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech
Beyond Video-to-SFX: Video to Audio Synthesis with Environmentally Aware Speech
Xinlei Niu
Jianbo Ma
Dylan Harper-Harris
Xiangyu Zhang
Charles Patrick Martin
Jing Zhang
DiffMVGen
76
0
0
19 Sep 2025
DiTReducio: A Training-Free Acceleration for DiT-Based TTS via Progressive Calibration
DiTReducio: A Training-Free Acceleration for DiT-Based TTS via Progressive Calibration
Yanru Huo
Ziyue Jiang
Zuoli Tang
Q. Hong
Zhou Zhao
128
1
0
11 Sep 2025
Spectral Masking and Interpolation Attack (SMIA): A Black-box Adversarial Attack against Voice Authentication and Anti-Spoofing Systems
Spectral Masking and Interpolation Attack (SMIA): A Black-box Adversarial Attack against Voice Authentication and Anti-Spoofing Systems
Kamel Kamel
Hridoy Sankar Dutta
Keshav Sood
Sunil Aryal
AAML
100
0
0
09 Sep 2025
LibriQuote: A Speech Dataset of Fictional Character Utterances for Expressive Zero-Shot Speech Synthesis
LibriQuote: A Speech Dataset of Fictional Character Utterances for Expressive Zero-Shot Speech Synthesis
Gaspard Michel
Elena V. Epure
Christophe Cerisara
104
0
0
04 Sep 2025
FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot
FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot
Kun Xie
Feiyu Shen
Junjie Li
Fenglong Xie
Xu Tang
Yao Hu
142
9
0
02 Sep 2025
CodecBench: A Comprehensive Benchmark for Acoustic and Semantic Evaluation
CodecBench: A Comprehensive Benchmark for Acoustic and Semantic Evaluation
Ruifan Deng
Yitian Gong
Qinghui Gao
Luozhijie Jin
Qinyuan Cheng
Zhaoye Fei
Shimin Li
Xipeng Qiu
AuLLM
113
2
0
28 Aug 2025
Multi-Metric Preference Alignment for Generative Speech Restoration
Multi-Metric Preference Alignment for Generative Speech Restoration
Junan Zhang
Xueyao Zhang
Jing Yang
Yuancheng Wang
Fan Fan
Zhizheng Wu
148
5
0
24 Aug 2025
Vevo2: A Unified and Controllable Framework for Speech and Singing Voice Generation
Vevo2: A Unified and Controllable Framework for Speech and Singing Voice Generation
Xueyao Zhang
Junan Zhang
Yuancheng Wang
Chaoren Wang
Yuanzhe Chen
Dongya Jia
Zhuo Chen
Zhizheng Wu
DiffM
221
6
0
22 Aug 2025
E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model
E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model
Ronghao Lin
Shuai Shen
Weipeng Hu
Qiaolin He
Aolin Xiong
Li Huang
Haifeng Hu
Y. Tan
98
0
0
18 Aug 2025
Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech
Parallel GPT: Harmonizing the Independence and Interdependence of Acoustic and Semantic Information for Zero-Shot Text-to-Speech
Jingyuan Xing
Zhipeng Li
Jialong Mai
Xiaofen Xing
Xiangmin Xu
184
0
0
06 Aug 2025
SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents
SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents
C. Jiang
Jiajun Sun
Yifei Cao
Jiabao Zhuang
Hui Li
Xiaoran Fan
Ming-bo Wen
Junjie Ye
Jiajun Sun
251
0
0
04 Aug 2025
Adaptive Duration Model for Text Speech Alignment
Adaptive Duration Model for Text Speech Alignment
Junjie Cao
104
0
0
30 Jul 2025
Step-Audio 2 Technical Report
Step-Audio 2 Technical Report
Boyong Wu
Chao Yan
Chen Hu
Cheng Yi
Chengli Feng
...
Yuanwei Lu
Yuchu Luo
Yuhe Yin
Yumeng Zhan
Y. Zhang
AuLLM
235
0
0
22 Jul 2025
IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech
IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech
Siyi Zhou
Yiquan Zhou
Yi He
Xun Zhou
Jinchao Wang
Wei Deng
Jingchen Shu
DiffM
163
14
0
23 Jun 2025
InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems
InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems
Kexin Huang
Qian Tu
Liwei Fan
Chenchen Yang
Dong Zhang
Shimin Li
Zhaoye Fei
Qinyuan Cheng
Xipeng Qiu
199
5
0
19 Jun 2025
ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
ZipVoice: Fast and High-Quality Zero-Shot Text-to-Speech with Flow Matching
Han Zhu
Wei Kang
Zengwei Yao
Liyong Guo
Fangjun Kuang
Zhaoqing Li
Weiji Zhuang
Long Lin
Daniel Povey
311
11
0
16 Jun 2025
StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive Modeling
StreamMel: Real-Time Zero-shot Text-to-Speech via Interleaved Continuous Autoregressive ModelingIEEE Signal Processing Letters (IEEE SPL), 2025
Hui Wang
Yifan Yang
Shujie Liu
Jinyu Li
Lingwei Meng
Y. Liu
Jiaming Zhou
Haoqin Sun
Yan Lu
Yong Qin
184
3
0
14 Jun 2025
Towards Generalized Source Tracing for Codec-Based Deepfake Speech
Towards Generalized Source Tracing for Codec-Based Deepfake Speech
Xuanjun Chen
I-Ming Lin
Lin Zhang
Haibin Wu
Hung-yi Lee
J. Jang
332
1
0
08 Jun 2025
Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion
Discl-VC: Disentangled Discrete Tokens and In-Context Learning for Controllable Zero-Shot Voice Conversion
Kaidi Wang
Wenhao Guan
Ziyue Jiang
Hukai Huang
Peijie Chen
Weijie Wu
Q. Hong
Lin Li
178
3
0
30 May 2025
VoiceMark: Zero-Shot Voice Cloning-Resistant Watermarking Approach Leveraging Speaker-Specific Latents
VoiceMark: Zero-Shot Voice Cloning-Resistant Watermarking Approach Leveraging Speaker-Specific Latents
Haiyun Li
Zhiyong Wu
Xiaofeng Xie
Jingran Xie
Yaoxun Xu
Hanyang Peng
280
1
0
27 May 2025
VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation
VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation
Puyuan Peng
Shang-Wen Li
Abdelrahman Mohamed
David Harwath
183
0
0
26 May 2025
Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling
Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling
Qixi Zheng
Emmanouil Benetos
Zhikang Niu
Ziyang Ma
Xiaofei Wang
Kai Yu
Xie Chen
263
2
0
26 May 2025
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
Zhihao Du
Changfeng Gao
Yuxuan Wang
Fan Yu
Tianyu Zhao
...
Mengzhe Chen
Yafeng Chen
Shiliang Zhang
Wen Wang
Jieping Ye
AuLLM
294
51
0
23 May 2025
DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation
DualCodec: A Low-Frame-Rate, Semantically-Enhanced Neural Audio Codec for Speech Generation
Jiaqi Li
Xiaolong Lin
Zhekai Li
Shixi Huang
Yuancheng Wang
Chaoren Wang
Zhenpeng Zhan
Zhizheng Wu
381
11
0
19 May 2025
Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space
Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space
Zhengrui Ma
Yang Feng
Chenze Shao
Fandong Meng
Jie Zhou
Min Zhang
256
3
0
19 May 2025
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
Bowen Zhang
Congchao Guo
Geng Yang
Hang Yu
Haozhe Zhang
...
Yichen Xiao
Yiying Zhou
Yujiao Shi
Yuan Lu
Yucen He
247
22
0
12 May 2025
Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference Alignment
Advancing Zero-shot Text-to-Speech Intelligibility across Diverse Domains via Preference AlignmentAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Xueyao Zhang
Yijiao Wang
Chaoren Wang
Hui Yuan
Zhuo Chen
Zhizheng Wu
648
10
0
07 May 2025
SepALM: Audio Language Models Are Error Correctors for Robust Speech Separation
SepALM: Audio Language Models Are Error Correctors for Robust Speech SeparationInternational Joint Conference on Artificial Intelligence (IJCAI), 2025
Zhaoxi Mu
Xinyu Yang
Gang Wang
AuLLMKELMVLM
402
1
0
06 May 2025
12
Next