v1v2 (latest)

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

9 May 2022

Xu Tan

Jian Cong

Papers citing "NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality"

50 / 142 papers shown

BridgeVoC: Revitalizing Neural Vocoder from a Restoration Perspective

209

10 Nov 2025

NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion

Zongyang Du

Shreeram Suresh Chandra

156

31 Oct 2025

U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation

143

19 Oct 2025

Beyond Static Knowledge Messengers: Towards Adaptive, Fair, and Scalable Federated Learning for Medical AI

252

05 Oct 2025

Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis

...

194

29 Sep 2025

AUDDT: Audio Unified Deepfake Detection Benchmark Toolkit

126

25 Sep 2025

TMD-TTS: A Unified Tibetan Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation

186

22 Sep 2025

Accelerating Diffusion Transformer-Based Text-to-Speech with Transformer Layer Caching

Siratish Sakpiboonchit

129

10 Sep 2025

E3RG: Building Explicit Emotion-driven Empathetic Response Generation System with Multimodal Large Language Model

130

18 Aug 2025

Next Tokens Denoising for Speech Synthesis

204

30 Jul 2025

UmbraTTS: Adapting Text-to-Speech to Environmental Contexts with Flow Matching

332

11 Jun 2025

Zero-Shot Text-to-Speech for VietnameseAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Thi Vu

L. T. Nguyen

Dat Quoc Nguyen

217

02 Jun 2025

XMAD-Bench: Cross-Domain Multilingual Audio Deepfake Benchmark

Ioan-Paul Ciobanu

Andrei Iulian Hiji

Nicolae-Cătălin Ristea

Paul Irofti

Cristian Rusu

Radu Tudor Ionescu

186

31 May 2025

Dub-S2ST: Textless Speech-to-Speech Translation for Seamless Dubbing

Jeongsoo Choi

Jaehun Kim

Joon Son Chung

267

27 May 2025

CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning

336

25 May 2025

FMSD-TTS: Few-shot Multi-Speaker Multi-Dialect Text-to-Speech Synthesis for Ü-Tsang, Amdo and Kham Speech Dataset Generation

316

20 May 2025

DPN-GAN: Inducing Periodic Activations in Generative Adversarial Networks for High-Fidelity Audio SynthesisIEEE Access (IEEE Access), 2025

Zeeshan Ahmad

Shudi Bao

Meng Chen

232

14 May 2025

Lightweight End-to-end Text-to-speech Synthesis for low resource on-device applicationsSpeech Synthesis Workshop (SSW), 2023

360

12 May 2025

MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

...

321

12 May 2025

AGATE: Stealthy Black-box Watermarking for Multimodal Model Copyright Protection

302

28 Apr 2025

Protecting Your Voice: Temporal-aware Robust Watermarking

490

21 Apr 2025

Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis

...

371

14 Apr 2025

"It's not a representation of me": Examining Accent Bias and Digital Exclusion in Synthetic AI Voice ServicesConference on Fairness, Accountability and Transparency (FAccT), 2025

Shira Michel

Sufi Kaur

Sarah Elizabeth Gillespie

Jeffrey Gleason

Christo Wilson

A. Ghosh

281

12 Apr 2025

Watermarking for AI Content Detection: A Review on Text, Visual, and Audio Modalities

Lele Cao

267

02 Apr 2025

LaPIG: Cross-Modal Generation of Paired Thermal and Visible Facial Images

Leyang Wang

Joice Lin

DiffM

278

20 Mar 2025

Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie DubbingComputer Vision and Pattern Recognition (CVPR), 2025

348

15 Mar 2025

An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR

Sewade Ogun

Vincent Colotte

Emmanuel Vincent

336

11 Mar 2025

CBW: Towards Dataset Ownership Verification for Speaker Verification via Clustering-based Backdoor Watermarking

856

02 Mar 2025

PodAgent: A Comprehensive Framework for Podcast GenerationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

912

01 Mar 2025

Clip-TTS: Contrastive Text-content and Mel-spectrogram, A High-Quality Text-to-Speech Method based on Contextual Semantic Understanding

Tianyun Liu

CLIP VLM

324

26 Feb 2025

DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis

383

21 Feb 2025

Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech GenerationIEEE Transactions on Audio, Speech, and Language Processing (TASLP), 2025

...

385

27 Jan 2025

MathReader : Text-to-Speech for Mathematical DocumentsIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025

321

13 Jan 2025

DrawSpeech: Expressive Speech Synthesis Using Prosodic Sketches as Control ConditionsIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2025

122

08 Jan 2025

CrossSpeech++: Cross-lingual Speech Synthesis with Decoupled Language and Speaker GenerationIEEE Transactions on Audio, Speech, and Language Processing (TASLP), 2024

357

31 Dec 2024

Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners

299

06 Dec 2024

SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text

372

03 Dec 2024

Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook

Florinel-Alin Croitoru

Andrei Iulian Hiji

Vlad Hondru

Nicolae-Cătălin Ristea

436

29 Nov 2024

Making Social Platforms Accessible: Emotion-Aware Speech Generation with Integrated Text AnalysisInternational Conference on Advances in Social Networks Analysis and Mining (ASONAM), 2024

Suparna De

Ionut Bostan

Nishanth Sastry

285

24 Oct 2024

Bahasa Harmony: A Comprehensive Dataset for Bahasa Text-to-Speech Synthesis with Discrete Codec Modeling of EnGen-TTSConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Onkar Kishor Susladkar

Vishesh Tripathi

Biddwan Ahmed

141

09 Oct 2024

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow MatchingAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

632

303

09 Oct 2024

Zero-Shot Text-to-Speech from Continuous Text Streams

Trung D. Q. Dang

David Aponte

Dung Tran

Tianyi Chen

K. Koishida

AuLLM VLM

178

01 Oct 2024

Audio-Based Linguistic Feature Extraction for Enhancing Multi-lingual and Low-Resource Text-to-SpeechConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Youngjae Kim

Yejin Jeon

Gary Geunbae Lee

274

27 Sep 2024

Description-based Controllable Text-to-Speech with Cross-Lingual Voice ControlIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

238

26 Sep 2024

FastTalker: Jointly Generating Speech and Conversational Gestures from Text

Zixin Guo

Jian Zhang

404

24 Sep 2024

Speechworthy Instruction-tuned Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Hyundong Justin Cho

Nicolaas Jedema

Leonardo F. R. Ribeiro

237

23 Sep 2024

Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models

Sijing Chen

Laipeng He

...

Xiang Zhang

310

18 Sep 2024

DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-SpeechIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

Tao Wang

...

Xiaopeng Wang

Yuankun Xie

Yukun Liu

Zhengqi Wen

Guanjun Li

DiffM

326

18 Sep 2024

StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style DiffusionNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

297

16 Sep 2024

Investigating Disentanglement in a Phoneme-level Speech Codec for Prosody ModelingSpoken Language Technology Workshop (SLT), 2024

312

13 Sep 2024