v1v2 (latest)

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

12 February 2024

Álvaro Martín-Cortinas

Soledad López Gambino

ArXiv (abs)PDF HTML HuggingFace (62 upvotes)

Papers citing "BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data"

50 / 68 papers shown

Vox-Evaluator: Enhancing Stability and Fidelity for Zero-shot TTS with A Multi-Level Evaluator

171

23 Oct 2025

Comprehend and Talk: Text to Speech Synthesis via Dual Language Modeling

156

26 Sep 2025

SimulMEGA: MoE Routers are Advanced Policy Makers for Simultaneous Speech Translation

284

01 Sep 2025

MPO: Multidimensional Preference Optimization for Language Model-based Text-to-Speech

117

31 Aug 2025

Mitigating Hallucinations in LM-Based TTS Models via Distribution Alignment Using GFlowNets

219

21 Aug 2025

Long-Context Speech Synthesis with Context-Aware Memory

221

20 Aug 2025

Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM

Dariia Puhach

Amir H. Payberah

Éva Székely

187

19 Aug 2025

The State Of TTS: A Case Study with Human Fooling Rates

Praveen Srinivasa Varadhan

157

06 Aug 2025

Dataset of News Articles with Provenance Metadata for Media Relevance Assessment

Tomas Peterka

Matyas Bohacek

235

11 Jun 2025

Audio Generation Through Score-Based Generative Modeling: Design Principles and Implementation

311

10 Jun 2025

CoVoMix2: Advancing Zero-Shot Dialogue Generation with Fully Non-Autoregressive Flow Matching

...

319

01 Jun 2025

Accelerating Flow-Matching-Based Text-to-Speech via Empirically Pruned Step Sampling

454

26 May 2025

The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages

Chris C. Emezue

NaijaVoices Community

Busayo Awobade

A. Owodunni

Handel Emezue

...

Nefertiti Nneoma Emezue

Sewade Ogun

Bunmi Akinremi

David Ifeoluwa Adelani

Chris Pal

394

26 May 2025

VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation

271

26 May 2025

CloneShield: A Framework for Universal Perturbation Against Zero-Shot Voice Cloning

409

25 May 2025

Eta-WavLM: Efficient Speaker Identity Removal in Self-Supervised Speech Representations Using a Simple Linear EquationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

287

25 May 2025

Discrete Audio Representations for Automated Audio Captioning

303

21 May 2025

Accelerating Autoregressive Speech Synthesis Inference With Speech Speculative Decoding

353

21 May 2025

Universal Semantic Disentangled Privacy-preserving Speech Representation Learning

...

Roberto Barra-Chicote

362

19 May 2025

MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable Speaker Encoder

...

404

12 May 2025

Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis

...

409

14 Apr 2025

USM-VC: Mitigating Timbre Leakage with Universal Semantic Mapping Residual Block for Voice Conversion

579

11 Apr 2025

Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis

256

10 Apr 2025

F5R-TTS: Improving Flow-Matching based Text-to-Speech with Group Relative Policy Optimization

581

03 Apr 2025

DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis

453

21 Feb 2025

Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech GenerationIEEE Transactions on Audio, Speech, and Language Processing (TASLP), 2025

...

444

27 Jan 2025

SD-Eval: A Benchmark Dataset for Spoken Dialogue Understanding Beyond WordsNeural Information Processing Systems (NeurIPS), 2024

504

17 Jan 2025

The NPU-HWC System for the ISCSLP 2024 Inspirational and Convincing Audio Generation ChallengeInternational Symposium on Chinese Spoken Language Processing (ISCSLP), 2024

269

31 Oct 2024

Fast and High-Quality Auto-Regressive Speech Synthesis via Speculative DecodingIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

412

29 Oct 2024

Enhancing Low-Resource ASR through Versatile TTS: Bridging the Data GapIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

295

22 Oct 2024

Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative DecodingIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

322

17 Oct 2024

SF-Speech: Straightened Flow for Zero-Shot Voice CloneIEEE Transactions on Audio, Speech, and Language Processing (TASLP), 2024

569

16 Oct 2024

Emphasis Rendering for Conversational Text-to-Speech with Multi-modal Multi-scale Context Modeling

351

12 Oct 2024

Graded Suspiciousness of Adversarial Texts to Human

Shakila Mahjabin Tonni

Pedro Faustini

Mark Dras

AAML

235

06 Oct 2024

Zero-Shot Text-to-Speech from Continuous Text Streams

Trung D. Q. Dang

David Aponte

Dung Tran

Tianyi Chen

K. Koishida

AuLLM VLM

194

01 Oct 2024

EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion ControlConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Haozhe Chen

Run Chen

Julia Hirschberg

325

01 Oct 2024

Description-based Controllable Text-to-Speech with Cross-Lingual Voice ControlIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

261

26 Sep 2024

Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions

Kun Zhou

You Zhang

Shengkui Zhao

Zexu Pan

Dianwen Ng

336

25 Sep 2024

Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models

Sijing Chen

Laipeng He

...

Xiang Zhang

344

18 Sep 2024

Speaking from Coarse to Fine: Improving Neural Codec Language Model via Multi-Scale Speech Coding and Generation

Haohan Guo

Fenglong Xie

Dongchao Yang

Xixin Wu

Helen Meng

320

18 Sep 2024

Seed-Music: A Unified Framework for High Quality and Controlled Music Generation

Ye Bai

Haonan Chen

Jitong Chen

Zhuo Chen

...

Shicen Zhou

365

13 Sep 2024

Zero-Shot Sing Voice Conversion: built upon clustering-based phoneme representations

Fengrun Zhang

265

12 Sep 2024

FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications

Xu Tang

Kun Xie

Kai-Tuo Xu

427

05 Sep 2024

SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech SynthesisSpoken Language Technology Workshop (SLT), 2024

Dongchao Yang

Xixin Wu

Helen Meng

214

02 Sep 2024

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec TransformerInternational Conference on Learning Representations (ICLR), 2024

Yuancheng Wang

Zhizheng Wu

523

181

01 Sep 2024

Text-to-Speech for Unseen Speakers via Low-Complexity Discrete Unit-Based Frame Selection

Ismail Rasim Ulgen

Shreeram Suresh Chandra

Junchen Lu

Berrak Sisman

1.0K

30 Aug 2024

Enabling Beam Search for Language Model-Based Text-to-Speech SynthesisIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024

267

29 Aug 2024

Language Model Can Listen While SpeakingAAAI Conference on Artificial Intelligence (AAAI), 2024

Yakun Song

Zhuo Chen

400

05 Aug 2024

Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation

317

01 Aug 2024

Overview of Speaker Modeling and Its Applications: From the Lens of Deep Speaker Representation Learning

Shuai Wang

Zheng-Shou Chen

Kong Aik Lee

Yan-min Qian

Haizhou Li

377

21 Jul 2024