ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.04421
  4. Cited By
NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level
  Quality
v1v2 (latest)

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
9 May 2022
Xu Tan
Jiawei Chen
Haohe Liu
Jian Cong
Chen Zhang
Yanqing Liu
Xi Wang
Yichong Leng
Yuanhao Yi
Lei He
Frank Soong
Tao Qin
Sheng Zhao
Tie-Yan Liu
ArXiv (abs)PDFHTML

Papers citing "NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality"

50 / 142 papers shown
BridgeVoC: Revitalizing Neural Vocoder from a Restoration Perspective
BridgeVoC: Revitalizing Neural Vocoder from a Restoration Perspective
Andong Li
Tong Lei
Rilin Chen
Kai Li
Meng Yu
Xiaodong Li
Dong Yu
C. Zheng
DiffM
209
0
0
10 Nov 2025
NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion
NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion
Zongyang Du
Shreeram Suresh Chandra
Ismail Rasim Ulgen
Aurosweta Mahapatra
Ali N. Salman
Carlos Busso
Berrak Sisman
156
0
0
31 Oct 2025
U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation
U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation
Xusheng Yang
Long Zhou
Wenfu Wang
Kai Hu
Shulin Feng
Chenxing Li
Meng Yu
Dong Yu
Y. Zou
143
1
0
19 Oct 2025
Beyond Static Knowledge Messengers: Towards Adaptive, Fair, and Scalable Federated Learning for Medical AI
Beyond Static Knowledge Messengers: Towards Adaptive, Fair, and Scalable Federated Learning for Medical AI
Jahidul Arafat
Fariha Tasmin
Sanjaya Poudel
Ahsan Habib Tareq
FedML
252
0
0
05 Oct 2025
Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis
Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis
Tianrui Wang
Haoyu Wang
Meng Ge
Cheng Gong
Chunyu Qiang
...
Xiaobao Wang
Eng Siong Chng
Xie Chen
Longbiao Wang
Jianwu Dang
194
0
0
29 Sep 2025
AUDDT: Audio Unified Deepfake Detection Benchmark Toolkit
AUDDT: Audio Unified Deepfake Detection Benchmark Toolkit
Yi Zhu
Heitor R. Guimarães
Arthur Pimentel
Tiago H. Falk
126
0
0
25 Sep 2025
TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation
TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation
Yutong Liu
Ziyue Zhang
Ban Ma-bao
Renzeng Duojie
Yuqing Cai
Yongbin Yu
Xiangxiang Wang
Fan Gao
Cheng Huang
Nyima Tashi
186
2
0
22 Sep 2025
Accelerating Diffusion Transformer-Based Text-to-Speech with Transformer Layer Caching
Accelerating Diffusion Transformer-Based Text-to-Speech with Transformer Layer Caching
Siratish Sakpiboonchit
129
0
0
10 Sep 2025
E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model
E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model
Ronghao Lin
Shuai Shen
Weipeng Hu
Qiaolin He
Aolin Xiong
Li Huang
Haifeng Hu
Y. Tan
130
0
0
18 Aug 2025
Next Tokens Denoising for Speech Synthesis
Next Tokens Denoising for Speech Synthesis
Yanqing Liu
Ruiqing Xue
C. Zhang
Yufei Liu
G. Wang
Bohan Li
Yao Qian
Lei He
Shujie Liu
Sheng Zhao
DiffM
204
2
0
30 Jul 2025
UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching
UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching
Neta Glazer
Aviv Navon
Yael Segal
Aviv Shamsian
Hilit Segev
Asaf Buchnick
Menachem Pirchi
Gil Hetz
Joseph Keshet
332
2
0
11 Jun 2025
Zero-Shot Text-to-Speech for Vietnamese
Zero-Shot Text-to-Speech for VietnameseAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Thi Vu
L. T. Nguyen
Dat Quoc Nguyen
217
2
0
02 Jun 2025
XMAD-Bench: Cross-Domain Multilingual Audio Deepfake Benchmark
XMAD-Bench: Cross-Domain Multilingual Audio Deepfake Benchmark
Ioan-Paul Ciobanu
Andrei Iulian Hiji
Nicolae-Cătălin Ristea
Paul Irofti
Cristian Rusu
Radu Tudor Ionescu
186
0
0
31 May 2025
Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing
Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing
Jeongsoo Choi
Jaehun Kim
Joon Son Chung
267
0
0
27 May 2025
CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning
CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning
Renyuan Li
Zhibo Liang
Haichuan Zhang
Tianyu Shi
Zhiyuan Cheng
Jia Shi
Carl Yang
Mingjie Tang
AAML
336
2
0
25 May 2025
FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation
FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation
Yutong Liu
Ziyue Zhang
Ban Ma-bao
Yuqing Cai
Yongbin Yu
Renzeng Duojie
Xiangxiang Wang
Fan Gao
Cheng Huang
Nyima Tashi
316
4
0
20 May 2025
DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio Synthesis
DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio SynthesisIEEE Access (IEEE Access), 2025
Zeeshan Ahmad
Shudi Bao
Meng Chen
232
2
0
14 May 2025
Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applications
Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applicationsSpeech Synthesis Workshop (SSW), 2023
Biel Tura Vecino
Adam Gabry's
Daniel Mątwicki
Andrzej Pomirski
Tom Iddon
Marius Cotescu
Jaime Lorenzo-Trueba
360
7
0
12 May 2025
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
Bowen Zhang
Congchao Guo
Geng Yang
Hang Yu
Haozhe Zhang
...
Yichen Xiao
Yiying Zhou
Yujiao Shi
Yuan Lu
Yucen He
321
32
0
12 May 2025
AGATE: Stealthy Black-box Watermarking for Multimodal Model Copyright Protection
AGATE: Stealthy Black-box Watermarking for Multimodal Model Copyright Protection
Jianbo Gao
Keke Gai
Jing Yu
Liehuang Zhu
Qi Wu
AAML
302
2
0
28 Apr 2025
Protecting Your Voice: Temporal-aware Robust Watermarking
Protecting Your Voice: Temporal-aware Robust Watermarking
Yue Li
Weizhi Liu
Dongdong Lin
Hui Tian
Hongxia Wang
490
0
0
21 Apr 2025
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis
Yifan Yang
Shixuan Liu
Jiajian Li
Yuxuan Hu
Haibin Wu
...
Haiyang Sun
Yanqing Liu
Yan Lu
Kai Yu
Xie Chen
371
7
0
14 Apr 2025
"It's not a representation of me": Examining Accent Bias and Digital Exclusion in Synthetic AI Voice Services
"It's not a representation of me": Examining Accent Bias and Digital Exclusion in Synthetic AI Voice ServicesConference on Fairness, Accountability and Transparency (FAccT), 2025
Shira Michel
Sufi Kaur
Sarah Elizabeth Gillespie
Jeffrey Gleason
Christo Wilson
A. Ghosh
281
8
0
12 Apr 2025
Watermarking for AI Content Detection: A Review on Text, Visual, and Audio Modalities
Watermarking for AI Content Detection: A Review on Text, Visual, and Audio Modalities
Lele Cao
267
3
0
02 Apr 2025
LaPIG: Cross-Modal Generation of Paired Thermal and Visible Facial Images
LaPIG: Cross-Modal Generation of Paired Thermal and Visible Facial Images
Leyang Wang
Joice Lin
DiffM
278
0
0
20 Mar 2025
Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing
Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie DubbingComputer Vision and Pattern Recognition (CVPR), 2025
Zhedong Zhang
Liang-Sheng Li
C. Yan
Chunshan Liu
Anton Van Den Hengel
Yuankai Qi
348
5
0
15 Mar 2025
An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR
Sewade Ogun
Vincent Colotte
Emmanuel Vincent
336
1
0
11 Mar 2025
CBW: Towards Dataset Ownership Verification for Speaker Verification via Clustering-based Backdoor Watermarking
CBW: Towards Dataset Ownership Verification for Speaker Verification via Clustering-based Backdoor Watermarking
Yiming Li
Kaiying Yan
Shuo Shao
Tongqing Zhai
Shu-Tao Xia
Zhan Qin
D. Tao
AAML
856
3
0
02 Mar 2025
PodAgent: A Comprehensive Framework for Podcast GenerationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Yujia Xiao
Lei He
Haohan Guo
Fenglong Xie
Tan Lee
912
3
0
01 Mar 2025
Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding
Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding
Tianyun Liu
CLIPVLM
324
2
0
26 Feb 2025
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis
Yingahao Aaron Li
Rithesh Kumar
Zeyu Jin
DiffM
383
0
0
21 Feb 2025
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech GenerationIEEE Transactions on Audio, Speech, and Language Processing (TASLP), 2025
Haorui He
Zengqiang Shang
Chaoren Wang
Xuyuan Li
Yicheng Gu
...
Peiyang Shi
Longji Xu
Kai Chen
Pengyuan Zhang
Zhikai Wu
AuLLM
385
20
0
27 Jan 2025
MathReader : Text-to-Speech for Mathematical Documents
MathReader : Text-to-Speech for Mathematical DocumentsIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
Sieun Hyeon
Kyudan Jung
N. Kim
Hyun Gon Ryu
Jaeyoung Do
321
5
0
13 Jan 2025
DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control Conditions
DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control ConditionsIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025
Weidong Chen
Shan Yang
Guangzhi Li
Xixin Wu
DiffM
122
9
0
08 Jan 2025
CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker Generation
CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker GenerationIEEE Transactions on Audio, Speech, and Language Processing (TASLP), 2024
Ji-Hoon Kim
Hong-Sun Yang
Yoon-Cheol Ju
Il-Hwan Kim
Byeong-Yeol Kim
Joon Son Chung
BDL
357
1
0
31 Dec 2024
Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
Ze Yuan
Yanqing Liu
Shujie Liu
Sheng Zhao
AuLLM
299
7
0
06 Dec 2024
SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from
  Text
SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text
Haohe Liu
Gaël Le Lan
Xinhao Mei
Zhaoheng Ni
Anurag Kumar
Varun K. Nagaraja
Wenwu Wang
Mark D. Plumbley
Yangyang Shi
Vikas Chandra
VGen
372
14
0
03 Dec 2024
Deepfake Media Generation and Detection in the Generative AI Era: A
  Survey and Outlook
Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook
Florinel-Alin Croitoru
Andrei Iulian Hiji
Vlad Hondru
Nicolae-Cătălin Ristea
Paul Irofti
Marius Popescu
Cristian Rusu
Radu Tudor Ionescu
Fahad Shahbaz Khan
Mubarak Shah
436
19
0
29 Nov 2024
Making Social Platforms Accessible: Emotion-Aware Speech Generation with
  Integrated Text Analysis
Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text AnalysisInternational Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2024
Suparna De
Ionut Bostan
Nishanth Sastry
285
0
0
24 Oct 2024
Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech
  Synthesis with Discrete Codec Modeling of EnGen-TTS
Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTSConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Onkar Kishor Susladkar
Vishesh Tripathi
Biddwan Ahmed
141
0
0
09 Oct 2024
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow MatchingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Emmanouil Benetos
Zhikang Niu
Ziyang Ma
Keqi Deng
Chunhui Wang
Jian Zhao
Kai Yu
Xie Chen
632
303
0
09 Oct 2024
Zero-Shot Text-to-Speech from Continuous Text Streams
Zero-Shot Text-to-Speech from Continuous Text Streams
Trung D. Q. Dang
David Aponte
Dung Tran
Tianyi Chen
K. Koishida
AuLLMVLM
178
9
0
01 Oct 2024
Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual
  and Low-Resource Text-to-Speech
Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-SpeechConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Youngjae Kim
Yejin Jeon
Gary Geunbae Lee
274
1
0
27 Sep 2024
Description-based Controllable Text-to-Speech with Cross-Lingual Voice
  Control
Description-based Controllable Text-to-Speech with Cross-Lingual Voice ControlIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Ryuichi Yamamoto
Yuma Shirahata
Masaya Kawamura
Kentaro Tachibana
DiffM
238
4
0
26 Sep 2024
FastTalker: Jointly Generating Speech and Conversational Gestures from
  Text
FastTalker: Jointly Generating Speech and Conversational Gestures from Text
Zixin Guo
Jian Zhang
404
4
0
24 Sep 2024
Speechworthy Instruction-tuned Language Models
Speechworthy Instruction-tuned Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Hyundong Justin Cho
Nicolaas Jedema
Leonardo F. R. Ribeiro
Karishma Sharma
Pedro Szekely
Alessandro Moschitti
Ruben Janssen
Jonathan May
ALM
237
4
0
23 Sep 2024
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models
Sijing Chen
Qi Liu
Laipeng He
Tianwei He
Wendi He
...
Huimin Zhang
Xiang Zhang
Guangcheng Zhao
Hongbin Zhou
Pengpeng Zou
310
12
0
18 Sep 2024
DPI-TTS: Directional Patch Interaction for Fast-Converging and Style
  Temporal Modeling in Text-to-Speech
DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-SpeechIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Xin Qi
Ruibo Fu
Zhengqi Wen
Tao Wang
Chunyu Qiang
...
Xiaopeng Wang
Yuankun Xie
Yukun Liu
Zhengqi Wen
Guanjun Li
DiffM
326
1
0
18 Sep 2024
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis
  with Distilled Time-Varying Style Diffusion
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style DiffusionNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Yinghao Aaron Li
Xilin Jiang
Cong Han
N. Mesgarani
DiffM
297
10
0
16 Sep 2024
Investigating Disentanglement in a Phoneme-level Speech Codec for
  Prosody Modeling
Investigating Disentanglement in a Phoneme-level Speech Codec for Prosody ModelingSpoken Language Technology Workshop (SLT), 2024
Sotirios Karapiperis
Nikolaos Ellinas
Alexandra Vioni
Junkwang Oh
Gunu Jho
Inchul Hwang
S. Raptis
312
3
0
13 Sep 2024
123
Next
Page 1 of 3