Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

IEEE Transactions on Audio, Speech, and Language Processing (IEEE TASLP), 2023

5 January 2023

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)Github (22090★)

Papers citing "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers"

50 / 611 papers shown

Gelina: Unified Speech and Gesture Synthesis via Interleaved Token Prediction

231

30 Mar 2026

M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis

...

238

04 Dec 2025

Q2D2: A Geometry-Aware Audio Codec Leveraging Two-Dimensional Quantization

Tal Shuster

Eliya Nachmani

164

01 Dec 2025

Harmonic-Percussive Disentangled Neural Audio Codec for Bandwidth Extension

230

26 Nov 2025

Codec2Vec: Self-Supervised Speech Representation Learning Using Neural Speech Codecs

Wei-Cheng Tseng

David Harwath

SSL

416

20 Nov 2025

Multi-modal Deepfake Detection and Localization with FPN-Transformer

146

11 Nov 2025

SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech

Lu Gan

Xi Li

135

11 Nov 2025

Step-Audio-EditX Technical Report

...

214

05 Nov 2025

NaturalVoices: A Large-Scale, Spontaneous and Emotional Podcast Dataset for Voice Conversion

Zongyang Du

Shreeram Suresh Chandra

201

31 Oct 2025

Bayesian Speech Synthesizers Can Learn from Multiple Teachers

179

28 Oct 2025

MC-SJD : Maximal Coupling Speculative Jacobi Decoding for Autoregressive Visual Generation Acceleration

166

28 Oct 2025

SoulX-Podcast: Towards Realistic Long-form Podcasts with Dialectal and Paralinguistic Diversity

...

249

27 Oct 2025

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation

...

461

26 Oct 2025

U-Codec: Ultra Low Frame-rate Neural Speech Codec for Fast High-fidelity Speech Generation

171

19 Oct 2025

DiSTAR: Diffusion over a Scalable Token Autoregressive Representation for Speech Generation

...

253

14 Oct 2025

Improving Generative Behavior Cloning via Self-Guidance and Adaptive Chunking

193

14 Oct 2025

Universal Discrete-Domain Speech Enhancement

186

11 Oct 2025

DialoSpeech: Dual-Speaker Dialogue Generation with LLM and Flow Matching

...

479

09 Oct 2025

UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

452

06 Oct 2025

Speak, Edit, Repeat: High-Fidelity Voice Editing and Zero-Shot TTS with Cross-Attentive Mamba

Baher Mohammad

Magauiya Zhussip

Stamatios Lefkimmiatis

Mamba

204

06 Oct 2025

Beyond Static Knowledge Messengers: Towards Adaptive, Fair, and Scalable Federated Learning for Medical AI

287

05 Oct 2025

Soft Disentanglement in Frequency Bands for Neural Audio Codecs

161

04 Oct 2025

Désentrelacement Fréquentiel Doux pour les Codecs Audio Neuronaux

172

04 Oct 2025

Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech

Hieu-Nghia Huynh-Nguyen

Huynh Nguyen Dang

Ngoc Son Nguyen

Van Nguyen

138

03 Oct 2025

FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates

259

01 Oct 2025

HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis

153

30 Sep 2025

Word-Level Emotional Expression Control in Zero-Shot Text-to-Speech Synthesis

...

241

29 Sep 2025

VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning

...

223

29 Sep 2025

Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling

159

26 Sep 2025

AUV: Teaching Audio Universal Vector Quantization with Single Nested Codebook

219

26 Sep 2025

ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection

179

26 Sep 2025

AUDDT: Audio Unified Deepfake Detection Benchmark Toolkit

148

25 Sep 2025

SPADE: Structured Pruning and Adaptive Distillation for Efficient LLM-TTS

207

25 Sep 2025

From Text to Talk: Audio-Language Model Needs Non-Autoregressive Joint Training

207

24 Sep 2025

Objective Evaluation of Prosody and Intelligibility in Speech Synthesis via Conditional Prediction of Discrete Tokens

190

24 Sep 2025

WEST: LLM based Speech Toolkit for Speech Understanding, Generation, and Interaction

...

515

24 Sep 2025

MaskVCT: Masked Voice Codec Transformer for Zero-Shot Voice Conversion With Increased Controllability via Multiple Guidances

Laureano Moro-Velazquez

Jesus Villalba

Najim Dehak

146

21 Sep 2025

MBCodec:Thorough disentangle for high-fidelity audio compression

243

21 Sep 2025

VoXtream: Full-Stream Text-to-Speech with Extremely Low Latency

246

19 Sep 2025

FocalCodec-Stream: Streaming Low-Bitrate Speech Coding via Causal Distillation

Luca Della Libera

Cem Subakan

Mirco Ravanelli

161

19 Sep 2025

Cross-Lingual F5-TTS: Towards Language-Agnostic Voice Cloning and Speech Synthesis

180

18 Sep 2025

DAIEN-TTS: Disentangled Audio Infilling for Environment-Aware Text-to-Speech Synthesis

231

18 Sep 2025

Neural Audio Codecs for Prompt-Driven Universal Sound Separation

Adhiraj Banerjee

Vipul Arora

VLM

287

15 Sep 2025

FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs

A. K. M. Mahbubur Rahman

335

14 Sep 2025

Length-Aware Rotary Position Embedding for Text-Speech Alignment

128

14 Sep 2025

GmSLM : Generative Marmoset Spoken Language Modeling

238

11 Sep 2025

DeCodec: Rethinking Audio Codecs as Universal Disentangled Representation Learners

234

11 Sep 2025

EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for Speech-to-Speech LLMs

128

11 Sep 2025

Finite Scalar Quantization Enables Redundant and Transmission-Robust Neural Audio Compression at Low Bit-rates

235

11 Sep 2025

Streaming Sequence-to-Sequence Learning with Delayed Streams Modeling

319

10 Sep 2025