ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2301.02111
  4. Cited By
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

5 January 2023
Chengyi Wang
Sanyuan Chen
Yu-Huan Wu
Zi-Hua Zhang
Long Zhou
Shujie Liu
Zhuo Chen
Yanqing Liu
Huaming Wang
Jinyu Li
Lei He
Sheng Zhao
Furu Wei
ArXivPDFHTML

Papers citing "Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers"

50 / 463 papers shown
Title
Optimizing Neural Speech Codec for Low-Bitrate Compression via
  Multi-Scale Encoding
Optimizing Neural Speech Codec for Low-Bitrate Compression via Multi-Scale Encoding
Peiji Yang
Fengping Wang
Yicheng Zhong
Huawei Wei
Zhisheng Wang
18
0
0
21 Oct 2024
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Alan Dao
Dinh Bach Vu
Huy Hoang Ha
AuLLM
VLM
59
3
0
20 Oct 2024
Accelerating Codec-based Speech Synthesis with Multi-Token Prediction
  and Speculative Decoding
Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding
Tan Dat Nguyen
Ji-Hoon Kim
Jeongsoo Choi
Shukjae Choi
Jinseok Park
Younglo Lee
Joon Son Chung
26
0
0
17 Oct 2024
SF-Speech: Straightened Flow for Zero-Shot Voice Clone
SF-Speech: Straightened Flow for Zero-Shot Voice Clone
Xuyuan Li
Zengqiang Shang
Hua Hua
Peiyang Shi
Chen Yang
Li Wang
Pengyuan Zhang
30
2
0
16 Oct 2024
JOOCI: a Framework for Learning Comprehensive Speech Representations
JOOCI: a Framework for Learning Comprehensive Speech Representations
Hemant Yadav
R. Shah
Sunayana Sitaram
16
0
0
14 Oct 2024
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow
  Matching
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Yushen Chen
Zhikang Niu
Ziyang Ma
Keqi Deng
Chunhui Wang
Jian Zhao
Kai Yu
Xie Chen
25
50
0
09 Oct 2024
Can DeepFake Speech be Reliably Detected?
Can DeepFake Speech be Reliably Detected?
Hongbin Liu
Youzheng Chen
Arun Narayanan
Athula Balachandran
Pedro J. Moreno
Lun Wang
AAML
14
1
0
09 Oct 2024
IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice
  Interaction Abilities
IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities
Xin Zhang
Xiang Lyu
Zhihao Du
Qian Chen
Dong Zhang
...
Yuxuan Wang
Bin Zhang
Heng Lu
Yaqian Zhou
Xipeng Qiu
AuLLM
28
5
0
09 Oct 2024
Sylber: Syllabic Embedding Representation of Speech from Raw Audio
Sylber: Syllabic Embedding Representation of Speech from Raw Audio
Cheol Jun Cho
Nicholas Lee
Akshat Gupta
Dhruv Agarwal
Ethan Chen
Alan W Black
Gopala K. Anumanchipalli
32
0
0
09 Oct 2024
CTC-GMM: CTC guided modality matching for fast and accurate streaming
  speech translation
CTC-GMM: CTC guided modality matching for fast and accurate streaming speech translation
Rui Zhao
Jinyu Li
Ruchao Fan
Matt Post
26
1
0
07 Oct 2024
SegINR: Segment-wise Implicit Neural Representation for Sequence
  Alignment in Neural Text-to-Speech
SegINR: Segment-wise Implicit Neural Representation for Sequence Alignment in Neural Text-to-Speech
Minchan Kim
Myeonghun Jeong
Joun Yeop Lee
Nam Soo Kim
16
0
0
07 Oct 2024
HALL-E: Hierarchical Neural Codec Language Model for Minute-Long
  Zero-Shot Text-to-Speech Synthesis
HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis
Yuto Nishimura
Takumi Hirose
Masanari Ohi
Hideki Nakayama
Nakamasa Inoue
VLM
29
1
0
06 Oct 2024
Graded Suspiciousness of Adversarial Texts to Human
Graded Suspiciousness of Adversarial Texts to Human
Shakila Mahjabin Tonni
Pedro Faustini
Mark Dras
AAML
16
0
0
06 Oct 2024
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition
Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition
Zixuan Wang
Chi-Keung Tang
Chi-Keung Tang
DiffM
VGen
LLMAG
41
4
0
04 Oct 2024
Augmentation through Laundering Attacks for Audio Spoof Detection
Augmentation through Laundering Attacks for Audio Spoof Detection
Hashim Ali
Surya Subramani
Hafiz Malik
16
0
0
01 Oct 2024
Improving curriculum learning for target speaker extraction with
  synthetic speakers
Improving curriculum learning for target speaker extraction with synthetic speakers
Yun Liu
Xuechen Liu
Junichi Yamagishi
18
0
0
01 Oct 2024
Zero-Shot Text-to-Speech from Continuous Text Streams
Zero-Shot Text-to-Speech from Continuous Text Streams
Trung D. Q. Dang
David Aponte
Dung Tran
Tianyi Chen
K. Koishida
AuLLM
VLM
24
3
0
01 Oct 2024
Recent Advances in Speech Language Models: A Survey
Recent Advances in Speech Language Models: A Survey
Wenqian Cui
Dianzhi Yu
Xiaoqi Jiao
Ziqiao Meng
Guangyan Zhang
Qichao Wang
Yiwen Guo
Irwin King
AuLLM
59
14
0
01 Oct 2024
Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for
  Neural Codec Language Models
Analyzing and Mitigating Inconsistency in Discrete Audio Tokens for Neural Codec Language Models
Wenrui Liu
Zhifang Guo
Jin Xu
Yuanjun Lv
Yunfei Chu
Zhou Zhao
Junyang Lin
41
1
0
28 Sep 2024
EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based
  Speech Synthesis
EmoPro: A Prompt Selection Strategy for Emotional Expression in LM-based Speech Synthesis
Haoyu Wang
Chunyu Qiang
Tianrui Wang
Cheng Gong
Qiuyu Liu
Yu Jiang
Xiaobao Wang
Chenyang Wang
Chen Zhang
31
1
0
27 Sep 2024
MIO: A Foundation Model on Multimodal Tokens
MIO: A Foundation Model on Multimodal Tokens
Zekun Wang
King Zhu
Chunpu Xu
Wangchunshu Zhou
Jiaheng Liu
...
Yuanxing Zhang
Ge Zhang
Ke Xu
Jie Fu
Wenhao Huang
MLLM
AuLLM
51
11
0
26 Sep 2024
Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions
Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions
Kun Zhou
You Zhang
Shengkui Zhao
Hao Wang
Zexu Pan
...
Chongjia Ni
Yukun Ma
Trung Hieu Nguyen
J. Yip
Bin Ma
44
5
0
25 Sep 2024
NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple
  Speakers
NanoVoice: Efficient Speaker-Adaptive Text-to-Speech for Multiple Speakers
Nohil Park
Heeseung Kim
Che Hyun Lee
Jooyoung Choi
Jiheum Yeom
Sungroh Yoon
20
2
0
24 Sep 2024
VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient
  Speaker-Adaptive Text-to-Speech via Autoguidance
VoiceGuider: Enhancing Out-of-Domain Performance in Parameter-Efficient Speaker-Adaptive Text-to-Speech via Autoguidance
Jiheum Yeom
Heeseung Kim
Jooyoung Choi
Che Hyun Lee
Nohil Park
Sungroh Yoon
19
1
0
24 Sep 2024
StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion
  for Zero-shot Text-to-speech Synthesis
StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis
Zhiyong Chen
Xinnuo Li
Zhiqi Ai
Shugong Xu
DiffM
28
1
0
24 Sep 2024
Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec
  models
Codec-SUPERB @ SLT 2024: A lightweight benchmark for neural audio codec models
Haibin Wu
Xuanjun Chen
Yi-Cheng Lin
Kaiwei Chang
Jiawei Du
...
Yi-Chiao Wu
Xu Tan
James Glass
Shinji Watanabe
Hung-yi Lee
24
6
0
21 Sep 2024
Audio Codec Augmentation for Robust Collaborative Watermarking of Speech
  Synthesis
Audio Codec Augmentation for Robust Collaborative Watermarking of Speech Synthesis
Lauri Juvela
Xin Eric Wang
21
2
0
20 Sep 2024
MuCodec: Ultra Low-Bitrate Music Codec
MuCodec: Ultra Low-Bitrate Music Codec
Yaoxun Xu
Hangting Chen
Jianwei Yu
Wei Tan
Rongzhi Gu
Shun Lei
Zhiwei Lin
Zhiyong Wu
20
1
0
20 Sep 2024
NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector
  Quantization
NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization
Zhikang Niu
Sanyuan Chen
Long Zhou
Ziyang Ma
Xie Chen
Shujie Liu
29
2
0
19 Sep 2024
Preference Alignment Improves Language Model-Based TTS
Preference Alignment Improves Language Model-Based TTS
Jinchuan Tian
Chunlei Zhang
Jiatong Shi
Hao Zhang
Jianwei Yu
Shinji Watanabe
Dong Yu
25
7
0
19 Sep 2024
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models
Sijing Chen
Yuan Feng
Laipeng He
Tianwei He
Wendi He
...
Huimin Zhang
Xiang Zhang
Guangcheng Zhao
Hongbin Zhou
Pengpeng Zou
25
4
0
18 Sep 2024
Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality
  Speech LLM Training and Inference
Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference
Edresson Casanova
Ryan Langman
Paarth Neekhara
Shehzeen Samarah Hussain
Jason Chun Lok Li
Subhankar Ghosh
Ante Jukić
Sang-gil Lee
AuLLM
29
2
0
18 Sep 2024
SpoofCeleb: Speech Deepfake Detection and SASV In The Wild
SpoofCeleb: Speech Deepfake Detection and SASV In The Wild
Jee-weon Jung
Yihan Wu
Xin Wang
Ji-Hoon Kim
Soumi Maiti
...
Joon Son Chung
Wangyou Zhang
Seyun Um
Shinnosuke Takamichi
Shinji Watanabe
60
1
0
18 Sep 2024
Leveraging AI-Generated Emotional Self-Voice to Nudge People towards
  their Ideal Selves
Leveraging AI-Generated Emotional Self-Voice to Nudge People towards their Ideal Selves
Cathy Mengying Fang
Phoebe Chua
Samantha Chan
J. Leong
Andria Bao
Pattie Maes
16
1
0
17 Sep 2024
Enhancing Multilingual Speech Generation and Recognition Abilities in
  LLMs with Constructed Code-switched Data
Enhancing Multilingual Speech Generation and Recognition Abilities in LLMs with Constructed Code-switched Data
Jing Xu
Daxin Tan
Jiaqi Wang
Xiao Chen
19
0
0
17 Sep 2024
Adaptive Large Language Models By Layerwise Attention Shortcuts
Adaptive Large Language Models By Layerwise Attention Shortcuts
Prateek Verma
Mert Pilanci
KELM
OffRL
42
0
0
17 Sep 2024
Emo-DPO: Controllable Emotional Speech Synthesis through Direct
  Preference Optimization
Emo-DPO: Controllable Emotional Speech Synthesis through Direct Preference Optimization
Xiaoxue Gao
Chen Zhang
Yiming Chen
Huayun Zhang
Nancy F. Chen
32
6
0
16 Sep 2024
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis
  with Distilled Time-Varying Style Diffusion
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion
Yinghao Aaron Li
Xilin Jiang
Cong Han
N. Mesgarani
DiffM
29
4
0
16 Sep 2024
Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation
Improving Robustness of Diffusion-Based Zero-Shot Speech Synthesis via Stable Formant Generation
C. Han
Seokgi Lee
Gyuhyeon Nam
Gyeongsu Chae
DiffM
49
0
0
14 Sep 2024
Seed-Music: A Unified Framework for High Quality and Controlled Music
  Generation
Seed-Music: A Unified Framework for High Quality and Controlled Music Generation
Ye Bai
Haonan Chen
Jitong Chen
Zhuo Chen
Yi Deng
...
Hang Zhao
Ziyi Zhao
Dejian Zhong
Shicen Zhou
Pei Zou
DiffM
58
6
0
13 Sep 2024
Affective Computing Has Changed: The Foundation Model Disruption
Affective Computing Has Changed: The Foundation Model Disruption
Björn Schuller
Adria Mallol-Ragolta
Alejandro Pena Almansa
Iosif Tsangko
Mostafa M. Amin
A. Semertzidou
Lukas Christ
Shahin Amiriparian
28
0
0
13 Sep 2024
Text-To-Speech Synthesis In The Wild
Text-To-Speech Synthesis In The Wild
Jee-weon Jung
Wangyou Zhang
Soumi Maiti
Yihan Wu
Xin Wang
...
Hye-jin Shim
Nicholas W. D. Evans
Joon Son Chung
Shinnosuke Takamichi
Shinji Watanabe
32
1
0
13 Sep 2024
Investigating Disentanglement in a Phoneme-level Speech Codec for
  Prosody Modeling
Investigating Disentanglement in a Phoneme-level Speech Codec for Prosody Modeling
Sotirios Karapiperis
Nikolaos Ellinas
Alexandra Vioni
Junkwang Oh
Gunu Jho
Inchul Hwang
S. Raptis
21
0
0
13 Sep 2024
Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in
  New Paradigm
Muskits-ESPnet: A Comprehensive Toolkit for Singing Voice Synthesis in New Paradigm
Yuning Wu
Jiatong Shi
Yifeng Yu
Yuxun Tang
Tao Qian
Yueqian Lin
Jionghao Han
Xinyi Bai
Shinji Watanabe
Qin Jin
29
3
0
11 Sep 2024
Enhancing Kurdish Text-to-Speech with Native Corpus Training: A
  High-Quality WaveGlow Vocoder Approach
Enhancing Kurdish Text-to-Speech with Native Corpus Training: A High-Quality WaveGlow Vocoder Approach
Abdulhady Abas Abdullah
Sabat Salih Muhamad
Hadi Veisi
15
0
0
10 Sep 2024
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Qingkai Fang
Shoutao Guo
Yan Zhou
Zhengrui Ma
Shaolei Zhang
Yang Feng
AuLLM
25
29
0
10 Sep 2024
Estimating the Completeness of Discrete Speech Units
Estimating the Completeness of Discrete Speech Units
Sung-Lin Yeh
Hao Tang
28
1
0
09 Sep 2024
SongCreator: Lyrics-based Universal Song Generation
SongCreator: Lyrics-based Universal Song Generation
Shun Lei
Yixuan Zhou
Boshi Tang
Max W. Y. Lam
Feng Liu
Hangyu Liu
Jingcheng Wu
Shiyin Kang
Zhiyong Wu
Helen Meng
38
4
0
09 Sep 2024
Investigating Neural Audio Codecs for Speech Language Model-Based Speech
  Generation
Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation
Jiaqi Li
Dongmei Wang
Xiaofei Wang
Yao Qian
Long Zhou
...
Junkun Chen
Sheng Zhao
Jinyu Li
Zhizheng Wu
Michael Zeng
AuLLM
19
2
0
06 Sep 2024
LAST: Language Model Aware Speech Tokenization
LAST: Language Model Aware Speech Tokenization
A. Turetzky
Yossi Adi
24
2
0
05 Sep 2024
Previous
123456...8910
Next