ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.02111
  4. Cited By
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

5 January 2023
Chengyi Wang
Sanyuan Chen
Yu-Huan Wu
Zi-Hua Zhang
Long Zhou
Shujie Liu
Zhuo Chen
Yanqing Liu
Huaming Wang
Jinyu Li
Lei He
Sheng Zhao
Furu Wei
ArXivPDFHTML

Papers citing "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers"

50 / 463 papers shown
Title
TacoLM: GaTed Attention Equipped Codec Language Model are Efficient
  Zero-Shot Text to Speech Synthesizers
TacoLM: GaTed Attention Equipped Codec Language Model are Efficient Zero-Shot Text to Speech Synthesizers
Yakun Song
Zhuo Chen
Xiaofei Wang
Ziyang Ma
Guanrou Yang
Xie Chen
AuLLM
27
3
0
22 Jun 2024
GLOBE: A High-quality English Corpus with Global Accents for Zero-shot
  Speaker Adaptive Text-to-Speech
GLOBE: A High-quality English Corpus with Global Accents for Zero-shot Speaker Adaptive Text-to-Speech
Wenbin Wang
Yang Song
Sanjay Jha
29
5
0
21 Jun 2024
DASB -- Discrete Audio and Speech Benchmark
DASB -- Discrete Audio and Speech Benchmark
Pooneh Mousavi
Luca Della Libera
J. Duret
Artem Ploujnikov
Cem Subakan
Mirco Ravanelli
35
12
0
20 Jun 2024
Talk With Human-like Agents: Empathetic Dialogue Through Perceptible
  Acoustic Reception and Reaction
Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction
Haoqiu Yan
Yongxin Zhu
Kai Zheng
Bing Liu
Haoyu Cao
Deqiang Jiang
Linli Xu
AuLLM
29
4
0
18 Jun 2024
A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis
A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis
Guoqiang Hu
Huaning Tan
Ruilai Li
13
2
0
18 Jun 2024
1000 African Voices: Advancing inclusive multi-speaker multi-accent
  speech synthesis
1000 African Voices: Advancing inclusive multi-speaker multi-accent speech synthesis
Sewade Ogun
A. Owodunni
Tobi Olatunji
Eniola Alese
Babatunde Oladimeji
Tejumade Afonja
Kayode Olaleye
Naome A. Etori
Tosin P. Adewumi
25
4
0
17 Jun 2024
How Should We Extract Discrete Audio Tokens from Self-Supervised Models?
How Should We Extract Discrete Audio Tokens from Self-Supervised Models?
Pooneh Mousavi
J. Duret
Salah Zaiem
Luca Della Libera
Artem Ploujnikov
Cem Subakan
Mirco Ravanelli
34
9
0
15 Jun 2024
MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech
  Representation from Self-supervised Learning Model
MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model
Jiatong Shi
Xutai Ma
Hirofumi Inaguma
Anna Y. Sun
Shinji Watanabe
50
7
0
14 Jun 2024
SingOMD: Singing Oriented Multi-resolution Discrete Representation
  Construction from Speech Models
SingOMD: Singing Oriented Multi-resolution Discrete Representation Construction from Speech Models
Yuxun Tang
Yuning Wu
Jiatong Shi
Qin Jin
52
5
0
13 Jun 2024
CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal
  Dysarthric Speech Reconstruction
CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction
Xueyuan Chen
Dongchao Yang
Dingdong Wang
Xixin Wu
Zhiyong Wu
Helen Meng
33
1
0
12 Jun 2024
LAFMA: A Latent Flow Matching Model for Text-to-Audio Generation
LAFMA: A Latent Flow Matching Model for Text-to-Audio Generation
Wenhao Guan
K. Wang
Wangjin Zhou
Yang Wang
Feng Deng
Hui Wang
Lin Li
Q. Hong
Yong Qin
DiffM
28
3
0
12 Jun 2024
Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio
Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio
Yi Lu
Yuankun Xie
Ruibo Fu
Zhengqi Wen
Jianhua Tao
...
Xuefei Liu
Yongwei Li
Yukun Liu
Xiaopeng Wang
Shuchen Shi
27
1
0
12 Jun 2024
VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual
  Text-to-Speech
VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech
Ashishkumar Gudmalwar
Nirmesh Shah
Sai Akarsh
Pankaj Wasnik
R. Shah
19
1
0
12 Jun 2024
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via
  Monotonic Alignment
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment
Bing Han
Long Zhou
Shujie Liu
Sanyuan Chen
Lingwei Meng
Yanming Qian
Yanqing Liu
Sheng Zhao
Jinyu Li
Furu Wei
33
13
0
12 Jun 2024
The Interspeech 2024 Challenge on Speech Processing Using Discrete Units
The Interspeech 2024 Challenge on Speech Processing Using Discrete Units
Xuankai Chang
Jiatong Shi
Jinchuan Tian
Yuning Wu
Yuxun Tang
Yihan Wu
Shinji Watanabe
Yossi Adi
Xie Chen
Qin Jin
43
15
0
11 Jun 2024
Just Because We Camp, Doesn't Mean We Should: The Ethics of Modelling
  Queer Voices
Just Because We Camp, Doesn't Mean We Should: The Ethics of Modelling Queer Voices
A. Sigurgeirsson
Eddie L. Ungless
31
2
0
11 Jun 2024
CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from
  Codec-Based Speech Synthesis Systems
CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems
Haibin Wu
Yuan Tseng
Hung-yi Lee
AuLLM
17
6
0
11 Jun 2024
Controlling Emotion in Text-to-Speech with Natural Language Prompts
Controlling Emotion in Text-to-Speech with Natural Language Prompts
Thomas Bott
Florian Lux
Ngoc Thang Vu
28
6
0
10 Jun 2024
Learning Fine-Grained Controllability on Speech Generation via Efficient
  Fine-Tuning
Learning Fine-Grained Controllability on Speech Generation via Efficient Fine-Tuning
Chung-Ming Chien
Andros Tjandra
Apoorv Vyas
Matt Le
Bowen Shi
Wei-Ning Hsu
32
0
0
10 Jun 2024
An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot
  TTS
An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS
Xiaofei Wang
Sefik Emre Eskimez
Manthan Thakker
Hemin Yang
Zirun Zhu
...
Yufei Xia
Jinzhu Li
Sheng Zhao
Jinyu Li
Naoyuki Kanda
27
3
0
09 Jun 2024
Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody
  Modeling
Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling
Yuepeng Jiang
Tao Li
Fengyu Yang
Lei Xie
Meng Meng
Yujun Wang
25
2
0
09 Jun 2024
Autoregressive Diffusion Transformer for Text-to-Speech Synthesis
Autoregressive Diffusion Transformer for Text-to-Speech Synthesis
Zhijun Liu
Shuai Wang
Sho Inoue
Qibing Bai
Haizhou Li
DiffM
32
15
0
08 Jun 2024
Exploring the Benefits of Tokenization of Discrete Acoustic Units
Exploring the Benefits of Tokenization of Discrete Acoustic Units
Avihu Dekel
Raul Fernandez
30
2
0
08 Jun 2024
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text
  to Speech Synthesizers
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
Sanyuan Chen
Shujie Liu
Long Zhou
Yanqing Liu
Xu Tan
Jinyu Li
Sheng Zhao
Yao Qian
Furu Wei
VLM
29
64
0
08 Jun 2024
XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
Edresson Casanova
Kelly Davis
Eren Golge
Görkem Göknar
Iulian Gulea
...
Aya Aljafari
Joshua Meyer
Reuben Morais
Samuel Olayemi
Julian Weber
VLM
32
65
0
07 Jun 2024
PPPR: Portable Plug-in Prompt Refiner for Text to Audio Generation
PPPR: Portable Plug-in Prompt Refiner for Text to Audio Generation
Shuchen Shi
Ruibo Fu
Zhengqi Wen
Jianhua Tao
Tao Wang
...
Xuefei Liu
Yukun Liu
Yongwei Li
Zhiyong Wang
Xiaopeng Wang
18
1
0
07 Jun 2024
Small-E: Small Language Model with Linear Attention for Efficient Speech
  Synthesis
Small-E: Small Language Model with Linear Attention for Efficient Speech Synthesis
Théodor Lemerle
Nicolas Obin
Axel Roebel
29
6
0
06 Jun 2024
Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis
  with Context-Aware Contrastive Language-Audio Pretraining
Retrieval Augmented Generation in Prompt-based Text-to-Speech Synthesis with Context-Aware Contrastive Language-Audio Pretraining
Jinlong Xue
Yayue Deng
Yingming Gao
Ya Li
RALM
VLM
28
4
0
06 Jun 2024
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with
  Multi-Modal Context and Large Language Model
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model
Jinlong Xue
Yayue Deng
Yicheng Han
Yingming Gao
Ya Li
40
4
0
06 Jun 2024
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Zeyue Tian
Zhaoyang Liu
Ruibin Yuan
Jiahao Pan
Xiaoqiang Huang
Xu Tan
Xu Tan
Qifeng Chen
Y. Guo
VGen
97
16
0
06 Jun 2024
Generalized Source Tracing: Detecting Novel Audio Deepfake Algorithm
  with Real Emphasis and Fake Dispersion Strategy
Generalized Source Tracing: Detecting Novel Audio Deepfake Algorithm with Real Emphasis and Fake Dispersion Strategy
Yuankun Xie
Ruibo Fu
Zhengqi Wen
Zhiyong Wang
Xiaopeng Wang
Haonnan Cheng
Long Ye
Jianhua Tao
21
2
0
05 Jun 2024
Addressing Index Collapse of Large-Codebook Speech Tokenizer with
  Dual-Decoding Product-Quantized Variational Auto-Encoder
Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder
Haohan Guo
Fenglong Xie
Dongchao Yang
Hui Lu
Xixin Wu
Helen Meng
48
6
0
05 Jun 2024
Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech
  Recognition
Task Arithmetic can Mitigate Synthetic-to-Real Gap in Automatic Speech Recognition
Hsuan Su
Hua Farn
Fan-Yun Sun
Shang-Tse Chen
Hung-yi Lee
MoMe
24
2
0
05 Jun 2024
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive
  Modeling of Audio Discrete Codes
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
Trung D. Q. Dang
David Aponte
Dung Tran
K. Koishida
34
3
0
05 Jun 2024
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Philip Anastassiou
Jiawei Chen
J. Chen
Yuanzhe Chen
Zhuo Chen
...
Wenjie Zhang
Y. Zhang
Zilin Zhao
Dejian Zhong
Xiaobin Zhuang
41
74
0
04 Jun 2024
Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing
  Conversion
Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion
Ruiqi Li
Rongjie Huang
Yongqi Wang
Zhiqing Hong
Zhou Zhao
29
1
0
04 Jun 2024
SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar
  Latent Transformer Diffusion Models
SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models
Dongchao Yang
Dingdong Wang
Haohan Guo
Xueyuan Chen
Xixin Wu
Helen M. Meng
57
24
0
04 Jun 2024
MaskSR: Masked Language Model for Full-band Speech Restoration
MaskSR: Masked Language Model for Full-band Speech Restoration
Xu Li
Qirui Wang
Xiaoyu Liu
30
8
0
04 Jun 2024
Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis
Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis
Kun Zhou
Shengkui Zhao
Yukun Ma
Chong Zhang
Hao Wang
Dianwen Ng
Chongjia Ni
Nguyen Trung Hieu
J. Yip
Bin Ma
22
5
0
04 Jun 2024
ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and
  Zero-shot Language Style Control With Decoupled Codec
ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec
Shengpeng Ji
Jia-li Zuo
Minghui Fang
Siqi Zheng
Qian Chen
...
Ziyue Jiang
Hai Huang
Xize Cheng
Rongjie Huang
Zhou Zhao
45
7
0
03 Jun 2024
Generative Pre-trained Speech Language Model with Efficient Hierarchical
  Transformer
Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer
Yongxin Zhu
Dan Su
Liqiang He
Linli Xu
Dong Yu
31
5
0
03 Jun 2024
Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback
Enhancing Zero-shot Text-to-Speech Synthesis with Human Feedback
Chen Chen
Yuchen Hu
Wen Wu
Helin Wang
Chng Eng Siong
Chao Zhang
33
10
0
02 Jun 2024
Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in
  Zero and Few-shot Learning
Wav2Prompt: End-to-End Speech Prompt Generation and Tuning For LLM in Zero and Few-shot Learning
Keqi Deng
Guangzhi Sun
Phil Woodland
VLM
28
4
0
01 Jun 2024
A Survey of Deep Learning Audio Generation Methods
A Survey of Deep Learning Audio Generation Methods
Matej Bozic
Marko Horvat
VLM
MedIm
39
0
0
31 May 2024
SeamlessExpressiveLM: Speech Language Model for Expressive
  Speech-to-Speech Translation with Chain-of-Thought
SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought
Hongyu Gong
Bandhav Veluri
38
0
0
30 May 2024
TransVIP: Speech to Speech Translation System with Voice and Isochrony
  Preservation
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation
Chenyang Le
Yao Qian
Dongmei Wang
Long Zhou
Shujie Liu
...
Midia Yousefi
Yanmin Qian
Jinyu Li
Sheng Zhao
Michael Zeng
34
3
0
28 May 2024
C3LLM: Conditional Multimodal Content Generation Using Large Language
  Models
C3LLM: Conditional Multimodal Content Generation Using Large Language Models
Zixuan Wang
Qinkai Duan
Yu-Wing Tai
Chi-Keung Tang
27
3
0
25 May 2024
DAC-JAX: A JAX Implementation of the Descript Audio Codec
DAC-JAX: A JAX Implementation of the Descript Audio Codec
David Braun
21
0
0
19 May 2024
Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based
  Speech Language Model
Evaluating Text-to-Speech Synthesis from a Large Discrete Token-based Speech Language Model
Siyang Wang
Éva Székely
36
4
0
16 May 2024
PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset
PolyGlotFake: A Novel Multilingual and Multimodal DeepFake Dataset
Yang Hou
Haitao Fu
Chuankai Chen
Zida Li
Haoyu Zhang
Jianjun Zhao
24
3
0
14 May 2024
Previous
123456...8910
Next