ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.05370
  4. Cited By
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text
  to Speech Synthesizers

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers

8 June 2024
Sanyuan Chen
Shujie Liu
Long Zhou
Yanqing Liu
Xu Tan
Jinyu Li
Sheng Zhao
Yao Qian
Furu Wei
    VLM
ArXivPDFHTML

Papers citing "VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers"

46 / 46 papers shown
Title
Voice Cloning: Comprehensive Survey
Voice Cloning: Comprehensive Survey
Hussam Azzuni
Abdulmotaleb El Saddik
VLM
32
0
0
01 May 2025
AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation
AlignDiT: Multimodal Aligned Diffusion Transformer for Synchronized Speech Generation
J. Choi
Ji-Hoon Kim
Kim Sung-Bin
Tae-Hyun Oh
Joon Son Chung
DiffM
48
0
0
29 Apr 2025
Muyan-TTS: A Trainable Text-to-Speech Model Optimized for Podcast Scenarios with a $50K Budget
Muyan-TTS: A Trainable Text-to-Speech Model Optimized for Podcast Scenarios with a 50KBudget50K Budget50KBudget
Xin Li
Kaikai Jia
Hao Sun
Jun Dai
Z. L. Jiang
70
0
0
27 Apr 2025
GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture
GOAT-TTS: LLM-based Text-To-Speech Generation Optimized via A Dual-Branch Architecture
Yaodong Song
Hongjie Chen
Jie Lian
Yuxin Zhang
Guangmin Xia
Zehan Li
Genliang Zhao
Jian Kang
Y. Li
Jie Li
AuLLM
30
0
0
15 Apr 2025
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis
Yifan Yang
S. Liu
J. Li
Yuxuan Hu
Haibin Wu
...
Haiyang Sun
Yanqing Liu
Yan Lu
Kai Yu
Xie Chen
23
0
0
14 Apr 2025
SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation
SpeakEasy: Enhancing Text-to-Speech Interactions for Expressive Content Creation
Stephen Brade
Sam Anderson
Rithesh Kumar
Zeyu Jin
Anh Truong
31
0
0
07 Apr 2025
F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization
F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization
Xiaohui Sun
Ruitong Xiao
Jianye Mo
Bowen Wu
Qun Yu
Baoxun Wang
39
1
0
03 Apr 2025
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models
VoiceCraft-Dub: Automated Video Dubbing with Neural Codec Language Models
Kim Sung-Bin
Jeongsoo Choi
Puyuan Peng
Joon Son Chung
Tae-Hyun Oh
David F. Harwath
VGen
45
1
0
03 Apr 2025
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
X. Wang
Mingqi Jiang
Z. Ma
Ziyu Zhang
S. Liu
...
Zhifei Li
Xie Chen
Lei Xie
Y. Guo
Wei Xue
73
10
0
03 Mar 2025
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
Ziyue Jiang
Yi Ren
Ruiqi Li
Shengpeng Ji
Zhenhui Ye
...
Y. Zhang
Rui Liu
Xiang Yin
Zhou Zhao
Zhou Zhao
64
3
0
26 Feb 2025
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis
Yingahao Aaron Li
Rithesh Kumar
Zeyu Jin
DiffM
91
0
0
21 Feb 2025
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
Z. Liu
Shuangrui Ding
Zhixiong Zhang
Xiaoyi Dong
Pan Zhang
Yuhang Zang
Y. Cao
D. Lin
Jiaqi Wang
74
0
0
18 Feb 2025
FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching
FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching
Hui Wang
Shujie Liu
Lingwei Meng
J. Li
Yifan Yang
...
Yanqing Liu
Haoqin Sun
Jiaming Zhou
Yan Lu
Yong Qin
48
0
0
16 Feb 2025
Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement
Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement
Qianniu Chen
Xiaoyang Hao
B. Li
Y. Liu
Li Lu
34
0
0
15 Jan 2025
AccentBox: Towards High-Fidelity Zero-Shot Accent Generation
AccentBox: Towards High-Fidelity Zero-Shot Accent Generation
Jinzuomu Zhong
Korin Richmond
Zhiba Su
Siqi Sun
53
4
0
10 Jan 2025
Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
Continuous Speech Tokens Makes LLMs Robust Multi-Modality Learners
Ze Yuan
Yanqing Liu
Shujie Liu
Sheng Zhao
AuLLM
74
1
0
06 Dec 2024
VQalAttent: a Transparent Speech Generation Pipeline based on
  Transformer-learned VQ-VAE Latent Space
VQalAttent: a Transparent Speech Generation Pipeline based on Transformer-learned VQ-VAE Latent Space
Armani Rodriguez
S. Kokalj-Filipovic
67
0
0
22 Nov 2024
Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model
  with Frozen LLM
Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
Xiong Wang
Yangze Li
Chaoyou Fu
Yunhang Shen
Lei Xie
Ke Li
Xing Sun
Long Ma
AuLLM
MLLM
29
26
0
01 Nov 2024
Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding
Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding
Bohan Li
Hankun Wang
Situo Zhang
Yiwei Guo
Kai Yu
31
5
0
29 Oct 2024
Asynchronous Tool Usage for Real-Time Agents
Asynchronous Tool Usage for Real-Time Agents
Antonio A. Ginart
Naveen Kodali
J. Lee
Caiming Xiong
Silvio Savarese
John Emmons
LLMAG
SyDa
28
0
0
28 Oct 2024
Enhancing TTS Stability in Hebrew using Discrete Semantic Units
Enhancing TTS Stability in Hebrew using Discrete Semantic Units
Ella Zeldes
Or Tal
Yossi Adi
27
0
0
28 Oct 2024
Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap
Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap
Guanrou Yang
Fan Yu
Z. Ma
Zhihao Du
Zhifu Gao
Shiliang Zhang
Xie Chen
27
1
0
22 Oct 2024
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Alan Dao
Dinh Bach Vu
Huy Hoang Ha
AuLLM
VLM
59
3
0
20 Oct 2024
Accelerating Codec-based Speech Synthesis with Multi-Token Prediction
  and Speculative Decoding
Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding
Tan Dat Nguyen
Ji-Hoon Kim
Jeongsoo Choi
Shukjae Choi
Jinseok Park
Younglo Lee
Joon Son Chung
26
0
0
17 Oct 2024
SF-Speech: Straightened Flow for Zero-Shot Voice Clone
SF-Speech: Straightened Flow for Zero-Shot Voice Clone
Xuyuan Li
Zengqiang Shang
Hua Hua
Peiyang Shi
Chen Yang
Li Wang
Pengyuan Zhang
40
2
0
16 Oct 2024
JOOCI: a Framework for Learning Comprehensive Speech Representations
JOOCI: a Framework for Learning Comprehensive Speech Representations
Hemant Yadav
R. Shah
Sunayana Sitaram
21
0
0
14 Oct 2024
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow
  Matching
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Yushen Chen
Zhikang Niu
Ziyang Ma
Keqi Deng
Chunhui Wang
Jian Zhao
Kai Yu
Xie Chen
25
51
0
09 Oct 2024
Can DeepFake Speech be Reliably Detected?
Can DeepFake Speech be Reliably Detected?
Hongbin Liu
Youzheng Chen
Arun Narayanan
Athula Balachandran
Pedro J. Moreno
Lun Wang
AAML
24
1
0
09 Oct 2024
IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice
  Interaction Abilities
IntrinsicVoice: Empowering LLMs with Intrinsic Real-time Voice Interaction Abilities
Xin Zhang
Xiang Lyu
Zhihao Du
Qian Chen
Dong Zhang
...
Yuxuan Wang
Bin Zhang
Heng Lu
Yaqian Zhou
Xipeng Qiu
AuLLM
33
5
0
09 Oct 2024
HALL-E: Hierarchical Neural Codec Language Model for Minute-Long
  Zero-Shot Text-to-Speech Synthesis
HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis
Yuto Nishimura
Takumi Hirose
Masanari Ohi
Hideki Nakayama
Nakamasa Inoue
VLM
29
1
0
06 Oct 2024
Zero-Shot Text-to-Speech from Continuous Text Streams
Zero-Shot Text-to-Speech from Continuous Text Streams
Trung D. Q. Dang
David Aponte
Dung Tran
Tianyi Chen
K. Koishida
AuLLM
VLM
29
3
0
01 Oct 2024
Description-based Controllable Text-to-Speech with Cross-Lingual Voice
  Control
Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control
Ryuichi Yamamoto
Yuma Shirahata
Masaya Kawamura
Kentaro Tachibana
DiffM
32
2
0
26 Sep 2024
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis
  with Distilled Time-Varying Style Diffusion
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion
Yinghao Aaron Li
Xilin Jiang
Cong Han
N. Mesgarani
DiffM
29
4
0
16 Sep 2024
Seed-Music: A Unified Framework for High Quality and Controlled Music
  Generation
Seed-Music: A Unified Framework for High Quality and Controlled Music Generation
Ye Bai
Haonan Chen
Jitong Chen
Zhuo Chen
Yi Deng
...
Hang Zhao
Ziyi Zhao
Dejian Zhong
Shicen Zhou
Pei Zou
DiffM
58
6
0
13 Sep 2024
Investigating Neural Audio Codecs for Speech Language Model-Based Speech
  Generation
Investigating Neural Audio Codecs for Speech Language Model-Based Speech Generation
Jiaqi Li
Dongmei Wang
Xiaofei Wang
Yao Qian
Long Zhou
...
Junkun Chen
Sheng Zhao
Jinyu Li
Zhizheng Wu
Michael Zeng
AuLLM
22
2
0
06 Sep 2024
SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection
SelectTTS: Synthesizing Anyone's Voice via Discrete Unit-Based Frame Selection
Ismail Rasim Ulgen
Shreeram Suresh Chandra
Junchen Lu
Berrak Sisman
80
0
0
30 Aug 2024
Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis
Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis
Zehai Tu
Guangyan Zhang
Yiting Lu
Adaeze Adigwe
Simon King
Yiwen Guo
32
0
0
29 Aug 2024
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
Shengpeng Ji
Ziyue Jiang
Xize Cheng
Yifu Chen
Minghui Fang
...
Rongjie Huang
Yidi Jiang
Qian Chen
Zhou Zhao
Zhou Zhao
VLM
52
32
0
29 Aug 2024
Does Current Deepfake Audio Detection Model Effectively Detect ALM-based
  Deepfake Audio?
Does Current Deepfake Audio Detection Model Effectively Detect ALM-based Deepfake Audio?
Yuankun Xie
Chenxu Xiong
Xiaopeng Wang
Zhiyong Wang
Yi Lu
...
Yukun Liu
Zhengqi Wen
Jianhua Tao
Guanjun Li
Long Ye
AuLLM
26
1
0
20 Aug 2024
Temporal Variability and Multi-Viewed Self-Supervised Representations to
  Tackle the ASVspoof5 Deepfake Challenge
Temporal Variability and Multi-Viewed Self-Supervised Representations to Tackle the ASVspoof5 Deepfake Challenge
Yuankun Xie
Xiaopeng Wang
Zhiyong Wang
Ruibo Fu
Zhengqi Wen
Haonan Cheng
Long Ye
35
1
0
13 Aug 2024
TTSDS -- Text-to-Speech Distribution Score
TTSDS -- Text-to-Speech Distribution Score
Christoph Minixhofer
Ondˇrej Klejch
Peter Bell
26
0
0
17 Jul 2024
Autoregressive Speech Synthesis without Vector Quantization
Autoregressive Speech Synthesis without Vector Quantization
Lingwei Meng
Long Zhou
Shujie Liu
Sanyuan Chen
Bing Han
...
Jinyu Li
Sheng Zhao
Xixin Wu
Helen Meng
Furu Wei
46
30
0
11 Jul 2024
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS
Sefik Emre Eskimez
Xiaofei Wang
Manthan Thakker
Canrun Li
Chung-Hsien Tsai
...
Min Tang
Xu Tan
Yanqing Liu
Sheng Zhao
Naoyuki Kanda
VLM
35
46
0
26 Jun 2024
HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio
  Codec
HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec
Dongchao Yang
Songxiang Liu
Rongjie Huang
Jinchuan Tian
Chao Weng
Yuexian Zou
138
118
0
04 May 2023
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Chengyi Wang
Sanyuan Chen
Yu-Huan Wu
Zi-Hua Zhang
Long Zhou
...
Huaming Wang
Jinyu Li
Lei He
Sheng Zhao
Furu Wei
43
637
0
05 Jan 2023
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice
  Conversion for everyone
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
Edresson Casanova
Julian Weber
C. Shulby
Arnaldo Cândido Júnior
Eren Golge
M. Ponti
171
377
0
04 Dec 2021
1