ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.08093
  4. Cited By
BASE TTS: Lessons from building a billion-parameter Text-to-Speech model
  on 100K hours of data
v1v2 (latest)

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

12 February 2024
Mateusz Lajszczak
Guillermo Cámbara
Yang Li
Fatih Beyhan
Arent van Korlaar
Fan Yang
Arnaud Joly
Álvaro Martín-Cortinas
Ammar Abbas
Adam Michalski
Alexis Moinet
S. Karlapati
Ewa Muszyñska
Haohan Guo
Bartosz Putrycz
Soledad López Gambino
Kayeon Yoo
Elena Sokolova
Thomas Drugman
    LM&MA
ArXiv (abs)PDFHTMLHuggingFace (62 upvotes)

Papers citing "BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data"

50 / 68 papers shown
Vox-Evaluator: Enhancing Stability and Fidelity for Zero-shot TTS with A Multi-Level Evaluator
Vox-Evaluator: Enhancing Stability and Fidelity for Zero-shot TTS with A Multi-Level Evaluator
H. Wang
Na Li
Chuke Wang
Shu Wu
Zhifeng Li
Dong Yu
DiffM
171
0
0
23 Oct 2025
Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling
Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling
Junjie Cao
Yichen Han
Ruonan Zhang
Xiaoyang Hao
Hongxiang Li
Shuaijiang Zhao
Yue Liu
Xiao-Ping Zhng
156
0
0
26 Sep 2025
SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation
SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation
Chenyang Le
Bing Han
Jinshun Li
Songyong Chen
Y. Qian
MoE
284
2
0
01 Sep 2025
MPO: Multidimensional Preference Optimization for Language Model-based Text-to-Speech
MPO: Multidimensional Preference Optimization for Language Model-based Text-to-Speech
Kangxiang Xia
Xinfa Zhu
Jixun Yao
Lei Xie
117
1
0
31 Aug 2025
Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets
Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets
Chenlin Liu
Minghui Fang
Patrick Zhang
Wei Zhou
Jie Gao
Jiqing Han
219
1
0
21 Aug 2025
Long-Context Speech Synthesis with Context-Aware Memory
Long-Context Speech Synthesis with Context-Aware Memory
Zhipeng Li
Xiaofen Xing
Jingyuan Xing
Hangrui Hu
Heng Lu
Xiangmin Xu
RALM
221
1
0
20 Aug 2025
Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM
Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM
Dariia Puhach
Amir H. Payberah
Éva Székely
187
2
0
19 Aug 2025
The State Of TTS: A Case Study with Human Fooling Rates
The State Of TTS: A Case Study with Human Fooling Rates
Praveen Srinivasa Varadhan
Sherry Thomas
Sai Teja M. S.
Suvrat Bhooshan
Mitesh M. Khapra
157
1
0
06 Aug 2025
Dataset of News Articles with Provenance Metadata for Media Relevance Assessment
Dataset of News Articles with Provenance Metadata for Media Relevance Assessment
Tomas Peterka
Matyas Bohacek
235
0
0
11 Jun 2025
Audio Generation Through Score-Based Generative Modeling: Design Principles and Implementation
Ge Zhu
Yutong Wen
Zhiyao Duan
DiffMMedIm
311
3
0
10 Jun 2025
CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching
CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching
Leying Zhang
Y. Qian
Xiaofei Wang
Manthan Thakker
Dongmei Wang
...
Haibin Wu
Yuxuan Hu
Jinyu Li
Yanmin Qian
Sheng Zhao
319
8
0
01 Jun 2025
Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling
Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling
Qixi Zheng
Emmanouil Benetos
Zhikang Niu
Ziyang Ma
Xiaofei Wang
Kai Yu
Xie Chen
454
4
0
26 May 2025
The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages
The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages
Chris C. Emezue
NaijaVoices Community
Busayo Awobade
A. Owodunni
Handel Emezue
...
Nefertiti Nneoma Emezue
Sewade Ogun
Bunmi Akinremi
David Ifeoluwa Adelani
Chris Pal
394
5
0
26 May 2025
VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation
VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation
Puyuan Peng
Shang-Wen Li
Abdelrahman Mohamed
David Harwath
271
0
0
26 May 2025
CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning
CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning
Renyuan Li
Zhibo Liang
Haichuan Zhang
Tianyu Shi
Zhiyuan Cheng
Jia Shi
Carl Yang
Mingjie Tang
AAML
409
2
0
25 May 2025
Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear Equation
Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear EquationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Giuseppe Ruggiero
Matteo Testa
Jurgen Van de Walle
Luigi Di Caro
287
2
0
25 May 2025
Discrete Audio Representations for Automated Audio Captioning
Discrete Audio Representations for Automated Audio Captioning
Jingguang Tian
Haoqin Sun
Xinhui Hu
Xinkang Xu
303
1
0
21 May 2025
Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding
Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding
Zijian Lin
Yang Zhang
Yougen Yuan
Yuming Yan
Jinjiang Liu
Zhiyong Wu
Pengfei Hu
Qun Yu
353
5
0
21 May 2025
Universal Semantic Disentangled Privacy-preserving Speech Representation Learning
Universal Semantic Disentangled Privacy-preserving Speech Representation Learning
Biel Tura Vecino
Subhadeep Maji
Aravind Varier
Antonio Bonafonte
Ivan Valles
...
Roberto Barra-Chicote
Ariya Rastrow
C. Papayiannis
Volker Leutnant
Trevor Wood
362
2
0
19 May 2025
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder
Bowen Zhang
Congchao Guo
Geng Yang
Hang Yu
Haozhe Zhang
...
Yichen Xiao
Yiying Zhou
Yujiao Shi
Yuan Lu
Yucen He
404
35
0
12 May 2025
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis
Yifan Yang
Shixuan Liu
Jiajian Li
Yuxuan Hu
Haibin Wu
...
Haiyang Sun
Yanqing Liu
Yan Lu
Kai Yu
Xie Chen
409
8
0
14 Apr 2025
USM-VC: Mitigating Timbre Leakage with Universal Semantic Mapping Residual Block for Voice Conversion
USM-VC: Mitigating Timbre Leakage with Universal Semantic Mapping Residual Block for Voice Conversion
Na Li
Chuke Wang
Yu Gu
Zhifeng Li
579
0
0
11 Apr 2025
Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis
Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis
Yizhong Geng
Jizhuo Xu
Zeyu Liang
Jinghan Yang
Xiaoyi Shi
Xiaoyu Shen
256
0
0
10 Apr 2025
F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization
F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization
Xiaohui Sun
Ruitong Xiao
Jianye Mo
Bowen Wu
Qun Yu
Baoxun Wang
581
17
0
03 Apr 2025
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis
Yingahao Aaron Li
Rithesh Kumar
Zeyu Jin
DiffM
453
0
0
21 Feb 2025
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation
Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech GenerationIEEE Transactions on Audio, Speech, and Language Processing (TASLP), 2025
Haorui He
Zengqiang Shang
Chaoren Wang
Xuyuan Li
Yicheng Gu
...
Peiyang Shi
Longji Xu
Kai Chen
Pengyuan Zhang
Zhikai Wu
AuLLM
444
22
0
27 Jan 2025
SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond Words
SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond WordsNeural Information Processing Systems (NeurIPS), 2024
Junyi Ao
Yuancheng Wang
Xiaohai Tian
Dekun Chen
Jing Zhang
Lu Lu
Longji Xu
Haizhou Li
Zhikai Wu
AuLLM
504
60
0
17 Jan 2025
The NPU-HWC System for the ISCSLP 2024 Inspirational and Convincing
  Audio Generation Challenge
The NPU-HWC System for the ISCSLP 2024 Inspirational and Convincing Audio Generation ChallengeInternational Symposium on Chinese Spoken Language Processing (ISCSLP), 2024
Dake Guo
Jixun Yao
Xinfa Zhu
Kangxiang Xia
Zhao Guo
Ziyu Zhang
Yun Wang
Jie Liu
Lei Xie
269
3
0
31 Oct 2024
Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative Decoding
Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative DecodingIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Bohan Li
Hankun Wang
Situo Zhang
Yiwei Guo
Kai Yu
412
18
0
29 Oct 2024
Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data Gap
Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data GapIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Guanrou Yang
Fan Yu
Tianhao Shen
Zhihao Du
Zhifu Gao
Shiliang Zhang
Xie Chen
295
14
0
22 Oct 2024
Accelerating Codec-based Speech Synthesis with Multi-Token Prediction
  and Speculative Decoding
Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative DecodingIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Tan Dat Nguyen
Ji-Hoon Kim
Jeongsoo Choi
Shukjae Choi
Jinseok Park
Younglo Lee
Joon Son Chung
322
9
0
17 Oct 2024
SF-Speech: Straightened Flow for Zero-Shot Voice Clone
SF-Speech: Straightened Flow for Zero-Shot Voice CloneIEEE Transactions on Audio, Speech, and Language Processing (TASLP), 2024
Xuyuan Li
Zengqiang Shang
Hua Hua
Peiyang Shi
Chen Yang
Li Wang
Pengyuan Zhang
569
5
0
16 Oct 2024
Emphasis Rendering for Conversational Text-to-Speech with Multi-modal
  Multi-scale Context Modeling
Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling
Rui Liu
Zhenqi Jia
Jie Yang
Yifan Hu
Hong Li
351
5
0
12 Oct 2024
Graded Suspiciousness of Adversarial Texts to Human
Graded Suspiciousness of Adversarial Texts to Human
Shakila Mahjabin Tonni
Pedro Faustini
Mark Dras
AAML
235
1
0
06 Oct 2024
Zero-Shot Text-to-Speech from Continuous Text Streams
Zero-Shot Text-to-Speech from Continuous Text Streams
Trung D. Q. Dang
David Aponte
Dung Tran
Tianyi Chen
K. Koishida
AuLLMVLM
194
13
0
01 Oct 2024
EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control
EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion ControlConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Haozhe Chen
Run Chen
Julia Hirschberg
325
13
0
01 Oct 2024
Description-based Controllable Text-to-Speech with Cross-Lingual Voice
  Control
Description-based Controllable Text-to-Speech with Cross-Lingual Voice ControlIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Ryuichi Yamamoto
Yuma Shirahata
Masaya Kawamura
Kentaro Tachibana
DiffM
261
4
0
26 Sep 2024
Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions
Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions
Kun Zhou
You Zhang
Shengkui Zhao
Shengkui Zhao
Zexu Pan
Dianwen Ng
336
10
0
25 Sep 2024
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models
Sijing Chen
Qi Liu
Laipeng He
Tianwei He
Wendi He
...
Huimin Zhang
Xiang Zhang
Guangcheng Zhao
Hongbin Zhou
Pengpeng Zou
344
13
0
18 Sep 2024
Speaking from Coarse to Fine: Improving Neural Codec Language Model via
  Multi-Scale Speech Coding and Generation
Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation
Haohan Guo
Fenglong Xie
Dongchao Yang
Xixin Wu
Helen Meng
320
8
0
18 Sep 2024
Seed-Music: A Unified Framework for High Quality and Controlled Music
  Generation
Seed-Music: A Unified Framework for High Quality and Controlled Music Generation
Ye Bai
Haonan Chen
Jitong Chen
Zhuo Chen
Yi Deng
...
Hang Zhao
Ziyi Zhao
Dejian Zhong
Shicen Zhou
Pei Zou
DiffM
365
22
0
13 Sep 2024
Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme
  representations
Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations
Wangjin Zhou
Fengrun Zhang
Yiming Liu
Wenhao Guan
Yi Zhao
He Qu
265
5
0
12 Sep 2024
FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications
FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications
Hao-Han Guo
Kun Liu
Fei-Yu Shen
Yi-Chen Wu
Xu Tang
Kun Xie
Kai-Tuo Xu
Kun Xie
Kai-Tuo Xu
427
89
0
05 Sep 2024
SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient
  Language Model Based Text-to-Speech Synthesis
SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech SynthesisSpoken Language Technology Workshop (SLT), 2024
Haohan Guo
Fenglong Xie
Kun Xie
Dongchao Yang
Dake Guo
Xixin Wu
Helen Meng
214
12
0
02 Sep 2024
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec
  Transformer
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec TransformerInternational Conference on Learning Representations (ICLR), 2024
Yuancheng Wang
Haoyue Zhan
Liwei Liu
Ruihong Zeng
Haotian Guo
Jiachen Zheng
Qiang Zhang
Shunsi Zhang
Shunsi Zhang
Zhizheng Wu
523
181
0
01 Sep 2024
Text-to-Speech for Unseen Speakers via Low-Complexity Discrete Unit-Based Frame Selection
Text-to-Speech for Unseen Speakers via Low-Complexity Discrete Unit-Based Frame Selection
Ismail Rasim Ulgen
Shreeram Suresh Chandra
Junchen Lu
Berrak Sisman
1.0K
1
0
30 Aug 2024
Enabling Beam Search for Language Model-Based Text-to-Speech Synthesis
Enabling Beam Search for Language Model-Based Text-to-Speech SynthesisIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Zehai Tu
Guangyan Zhang
Yiting Lu
Adaeze Adigwe
Simon King
Yiwen Guo
267
1
0
29 Aug 2024
Language Model Can Listen While Speaking
Language Model Can Listen While SpeakingAAAI Conference on Artificial Intelligence (AAAI), 2024
Ziyang Ma
Yakun Song
Chenpeng Du
Jian Cong
Zhuo Chen
Yuping Wang
Longji Xu
Xie Chen
AuLLM
400
53
0
05 Aug 2024
Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like
  Spontaneous Representation
Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation
Xinhan Di
Jiahao Lu
Yunming Liang
Junjie Zheng
Yihua Wang
Chaofan Ding
ALM
317
3
0
01 Aug 2024
Overview of Speaker Modeling and Its Applications: From the Lens of Deep
  Speaker Representation Learning
Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning
Shuai Wang
Zheng-Shou Chen
Kong Aik Lee
Yan-min Qian
Haizhou Li
377
28
0
21 Jul 2024
12
Next
Page 1 of 2