ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.07691
  4. Cited By
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion
  and Adversarial Training with Large Speech Language Models

StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

13 June 2023
Yinghao Aaron Li
Cong Han
Vinay S. Raghavan
Gavin Mischler
N. Mesgarani
    VLM
    DiffM
ArXivPDFHTML

Papers citing "StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models"

50 / 69 papers shown
Title
Voice Cloning: Comprehensive Survey
Voice Cloning: Comprehensive Survey
Hussam Azzuni
Abdulmotaleb El Saddik
VLM
32
0
0
01 May 2025
ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting
ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting
Y. Zhang
Wenxiang Guo
Changhao Pan
Z. Zhu
Tao Jin
Zhou Zhao
VGen
47
0
0
29 Apr 2025
Using Phonemes in cascaded S2S translation pipeline
Using Phonemes in cascaded S2S translation pipeline
Rene Pilz
Johannes Schneider
34
0
0
22 Apr 2025
Mitigating Timbre Leakage with Universal Semantic Mapping Residual Block for Voice Conversion
Mitigating Timbre Leakage with Universal Semantic Mapping Residual Block for Voice Conversion
Na Li
Chuke Wang
Yu Gu
Zhifeng Li
54
0
0
11 Apr 2025
Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing
Prosody-Enhanced Acoustic Pre-training and Acoustic-Disentangled Prosody Adapting for Movie Dubbing
Zhedong Zhang
Liang-Sheng Li
C. Yan
Chunshan Liu
A. Hengel
Yuankai Qi
83
2
0
15 Mar 2025
Automatic Teaching Platform on Vision Language Retrieval Augmented Generation
Automatic Teaching Platform on Vision Language Retrieval Augmented Generation
Ruslan Gokhman
Jialu Li
Youshan Zhang
VLM
41
0
0
07 Mar 2025
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis
Ziyue Jiang
Yi Ren
Ruiqi Li
Shengpeng Ji
Zhenhui Ye
...
Y. Zhang
Rui Liu
Xiang Yin
Zhou Zhao
Zhou Zhao
64
3
0
26 Feb 2025
AAD-LLM: Neural Attention-Driven Auditory Scene Understanding
AAD-LLM: Neural Attention-Driven Auditory Scene Understanding
Xilin Jiang
Sukru Samet Dindar
Vishal B. Choudhari
Stephan Bickel
A. Mehta
Guy M McKhann
A. Flinker
D. Friedman
N. Mesgarani
32
2
0
24 Feb 2025
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis
Yingahao Aaron Li
Rithesh Kumar
Zeyu Jin
DiffM
91
0
0
21 Feb 2025
Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance
Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance
Shehzeen Samarah Hussain
Paarth Neekhara
Xuesong Yang
Edresson Casanova
Subhankar Ghosh
Mikyas T. Desta
Roy Fejgin
Rafael Valle
Jason Chun Lok Li
59
2
0
07 Feb 2025
Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement
Towards Lightweight and Stable Zero-shot TTS with Self-distilled Representation Disentanglement
Qianniu Chen
Xiaoyang Hao
B. Li
Y. Liu
Li Lu
34
0
0
15 Jan 2025
ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible
  Speech Synthesis
ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis
Xiangheng He
Junjie Chen
Zixing Zhang
Björn W. Schuller
78
0
0
16 Dec 2024
SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from
  Text
SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text
Haohe Liu
Gaël Le Lan
Xinhao Mei
Zhaoheng Ni
Anurag Kumar
Varun K. Nagaraja
Wenwu Wang
Mark D. Plumbley
Yangyang Shi
Vikas Chandra
VGen
61
1
0
03 Dec 2024
Deepfake Media Generation and Detection in the Generative AI Era: A
  Survey and Outlook
Deepfake Media Generation and Detection in the Generative AI Era: A Survey and Outlook
Florinel-Alin Croitoru
Andrei Iulian Hiji
Vlad Hondru
Nicolae-Cătălin Ristea
Paul Irofti
Marius Popescu
Cristian Rusu
Radu Tudor Ionescu
F. Khan
Mubarak Shah
79
2
0
29 Nov 2024
High-precision medical speech recognition through synthetic data and
  semantic correction: UNITED-MEDASR
High-precision medical speech recognition through synthetic data and semantic correction: UNITED-MEDASR
Sourav Banerjee
Ayushi Agarwal
Promila Ghosh
76
2
0
24 Nov 2024
I Can Hear You: Selective Robust Training for Deepfake Audio Detection
I Can Hear You: Selective Robust Training for Deepfake Audio Detection
Zirui Zhang
Wei Hao
Aroon Sankoh
William Lin
Emanuel Mendiola-Ortiz
Junfeng Yang
Chengzhi Mao
AAML
26
2
0
31 Oct 2024
Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient
  Learner for text-to-speech synthesis
Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis
Théodor Lemerle
Harrison Vanderbyl
Vaibhav Srivastav
Nicolas Obin
Axel Roebel
31
1
0
30 Oct 2024
ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and
  Low-frequency Character Bigrams
ELAICHI: Enhancing Low-resource TTS by Addressing Infrequent and Low-frequency Character Bigrams
Srija Anand
Praveen Srinivasa Varadhan
Mehak Singal
Mitesh M. Khapra
20
0
0
23 Oct 2024
Adversarial Training: A Survey
Adversarial Training: A Survey
Mengnan Zhao
Lihe Zhang
Jingwen Ye
Huchuan Lu
Baocai Yin
Xinchao Wang
AAML
21
0
0
19 Oct 2024
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow
  Matching
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Yushen Chen
Zhikang Niu
Ziyang Ma
Keqi Deng
Chunhui Wang
Jian Zhao
Kai Yu
Xie Chen
25
51
0
09 Oct 2024
Word-wise intonation model for cross-language TTS systems
Word-wise intonation model for cross-language TTS systems
Tomilov A. A.
Gromova A. Y.
Svischev A. N
24
0
0
30 Sep 2024
StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion
  for Zero-shot Text-to-speech Synthesis
StyleFusion TTS: Multimodal Style-control and Enhanced Feature Fusion for Zero-shot Text-to-speech Synthesis
Zhiyong Chen
Xinnuo Li
Zhiqi Ai
Shugong Xu
DiffM
34
1
0
24 Sep 2024
Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality
  Speech LLM Training and Inference
Low Frame-rate Speech Codec: a Codec Designed for Fast High-quality Speech LLM Training and Inference
Edresson Casanova
Ryan Langman
Paarth Neekhara
Shehzeen Samarah Hussain
Jason Chun Lok Li
Subhankar Ghosh
Ante Jukić
Sang-gil Lee
AuLLM
29
2
0
18 Sep 2024
The Art of Storytelling: Multi-Agent Generative AI for Dynamic
  Multimodal Narratives
The Art of Storytelling: Multi-Agent Generative AI for Dynamic Multimodal Narratives
Samee Arif
Taimoor Arif
Muhammad Saad Haroon
Aamina Jamal Khan
Agha Ali Raza
Awais Athar
29
0
0
17 Sep 2024
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis
  with Distilled Time-Varying Style Diffusion
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion
Yinghao Aaron Li
Xilin Jiang
Cong Han
N. Mesgarani
DiffM
29
4
0
16 Sep 2024
E1 TTS: Simple and Fast Non-Autoregressive TTS
E1 TTS: Simple and Fast Non-Autoregressive TTS
Zhijun Liu
Shuai Wang
Pengcheng Zhu
Mengxiao Bi
Haizhou Li
VLM
DiffM
38
3
0
14 Sep 2024
DFADD: The Diffusion and Flow-Matching Based Audio Deepfake Dataset
DFADD: The Diffusion and Flow-Matching Based Audio Deepfake Dataset
Jiawei Du
I-Ming Lin
I-Hsiang Chiu
Xuanjun Chen
Haibin Wu
Wenze Ren
Yu Tsao
Hung-yi Lee
Jyh-Shing Roger Jang
DiffM
35
2
0
13 Sep 2024
Just ASR + LLM? A Study on Speech Large Language Models' Ability to
  Identify and Understand Speaker in Spoken Dialogue
Just ASR + LLM? A Study on Speech Large Language Models' Ability to Identify and Understand Speaker in Spoken Dialogue
Junkai Wu
Xulin Fan
Bo-Ru Lu
Xilin Jiang
N. Mesgarani
M. Hasegawa-Johnson
Mari Ostendorf
AuLLM
ELM
56
2
0
07 Sep 2024
SSDM: Scalable Speech Dysfluency Modeling
SSDM: Scalable Speech Dysfluency Modeling
Jiachen Lian
Xuanru Zhou
Z. Ezzes
Jet M J Vonk
Brittany Morin
D. Baquirin
Zachary Mille
M. G. Tempini
Gopala Anumanchipalli
AuLLM
30
1
0
29 Aug 2024
Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion
  of Whispered and Regular Speech
Improvement Speaker Similarity for Zero-Shot Any-to-Any Voice Conversion of Whispered and Regular Speech
Anastasia Avdeeva
Aleksei Gusev
22
0
0
21 Aug 2024
PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform
  Generation
PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform Generation
Sang-Hoon Lee
Ha-Yeong Choi
Seong-Whan Lee
OOD
DiffM
AI4TS
43
5
0
14 Aug 2024
Style-Talker: Finetuning Audio Language Model and Style-Based
  Text-to-Speech Model for Fast Spoken Dialogue Generation
Style-Talker: Finetuning Audio Language Model and Style-Based Text-to-Speech Model for Fast Spoken Dialogue Generation
Yinghao Aaron Li
Xilin Jiang
Jordan Darefsky
Ge Zhu
N. Mesgarani
31
2
0
13 Aug 2024
Central Kurdish Text-to-Speech Synthesis with Novel End-to-End
  Transformer Training
Central Kurdish Text-to-Speech Synthesis with Novel End-to-End Transformer Training
Hawraz A. Ahmad
Tarik A. Rashid
25
0
0
06 Aug 2024
Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech
  SpeechT5 Model
Zero-Shot vs. Few-Shot Multi-Speaker TTS Using Pre-trained Czech SpeechT5 Model
Jan Lehecka
Z. Hanzlícek
J. Matousek
Daniel Tihelka
26
0
0
24 Jul 2024
Enhancing Out-of-Vocabulary Performance of Indian TTS Systems for
  Practical Applications through Low-Effort Data Strategies
Enhancing Out-of-Vocabulary Performance of Indian TTS Systems for Practical Applications through Low-Effort Data Strategies
Srija Anand
Praveena Varadhan
Ashwin Sankar
Giri Raju
Mitesh M. Khapra
37
1
0
18 Jul 2024
Preset-Voice Matching for Privacy Regulated Speech-to-Speech Translation
  Systems
Preset-Voice Matching for Privacy Regulated Speech-to-Speech Translation Systems
Daniel Platnick
Bishoy Abdelnour
Eamon Earl
Rahul Kumar
Zahra Rezaei
Thomas Tsangaris
Faraj Lagum
23
0
0
18 Jul 2024
TTSDS -- Text-to-Speech Distribution Score
TTSDS -- Text-to-Speech Distribution Score
Christoph Minixhofer
Ondˇrej Klejch
Peter Bell
26
0
0
17 Jul 2024
Autoregressive Speech Synthesis without Vector Quantization
Autoregressive Speech Synthesis without Vector Quantization
Lingwei Meng
Long Zhou
Shujie Liu
Sanyuan Chen
Bing Han
...
Jinyu Li
Sheng Zhao
Xixin Wu
Helen Meng
Furu Wei
46
30
0
11 Jul 2024
FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech
  Synthesis
FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis
Yinlin Guo
Yening Lv
Jinqiao Dou
Yan Zhang
Yuehai Wang
18
0
0
30 Jun 2024
Articulatory Phonetics Informed Controllable Expressive Speech Synthesis
Articulatory Phonetics Informed Controllable Expressive Speech Synthesis
Zehua Kcriss Li
Meiying Melissa Chen
Yi Zhong
Pinxin Liu
Zhiyao Duan
34
0
0
15 Jun 2024
Diffusion Synthesizer for Efficient Multilingual Speech to Speech
  Translation
Diffusion Synthesizer for Efficient Multilingual Speech to Speech Translation
Nameer Hirschkind
Xiao Yu
Mahesh Kumar Nandwana
Joseph Liu
Eloi DuBois
...
Colin Sinclair
Kyle Spence
Charles Shang
Zoë Abrams
Morgan McGuire
27
0
0
14 Jun 2024
VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual
  Text-to-Speech
VECL-TTS: Voice identity and Emotional style controllable Cross-Lingual Text-to-Speech
Ashishkumar Gudmalwar
Nirmesh Shah
Sai Akarsh
Pankaj Wasnik
R. Shah
19
1
0
12 Jun 2024
Prompting Large Language Models with Audio for General-Purpose Speech
  Summarization
Prompting Large Language Models with Audio for General-Purpose Speech Summarization
Wonjune Kang
Deb Roy
LRM
21
7
0
10 Jun 2024
Autoregressive Diffusion Transformer for Text-to-Speech Synthesis
Autoregressive Diffusion Transformer for Text-to-Speech Synthesis
Zhijun Liu
Shuai Wang
Sho Inoue
Qibing Bai
Haizhou Li
DiffM
34
15
0
08 Jun 2024
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text
  to Speech Synthesizers
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
Sanyuan Chen
Shujie Liu
Long Zhou
Yanqing Liu
Xu Tan
Jinyu Li
Sheng Zhao
Yao Qian
Furu Wei
VLM
39
64
0
08 Jun 2024
RU-AI: A Large Multimodal Dataset for Machine Generated Content
  Detection
RU-AI: A Large Multimodal Dataset for Machine Generated Content Detection
Liting Huang
Zhihao Zhang
Yiran Zhang
Xiyue Zhou
Shoujin Wang
NoLa
38
2
0
07 Jun 2024
XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model
Edresson Casanova
Kelly Davis
Eren Golge
Görkem Göknar
Iulian Gulea
...
Aya Aljafari
Joshua Meyer
Reuben Morais
Samuel Olayemi
Julian Weber
VLM
32
66
0
07 Jun 2024
USAT: A Universal Speaker-Adaptive Text-to-Speech Approach
USAT: A Universal Speaker-Adaptive Text-to-Speech Approach
Wenbin Wang
Yang Song
Sanjay Jha
32
10
0
28 Apr 2024
CLAD: Robust Audio Deepfake Detection Against Manipulation Attacks with
  Contrastive Learning
CLAD: Robust Audio Deepfake Detection Against Manipulation Attacks with Contrastive Learning
Hao Wu
Jing Chen
Ruiying Du
Cong Wu
Kun He
Xingcan Shang
Hao Ren
Guowen Xu
AAML
37
7
0
24 Apr 2024
FlashSpeech: Efficient Zero-Shot Speech Synthesis
FlashSpeech: Efficient Zero-Shot Speech Synthesis
Zhen Ye
Zeqian Ju
Haohe Liu
Xu Tan
Jianyi Chen
...
Weizhen Bian
Shulin He
Qi-fei Liu
Yi-Ting Guo
Wei Xue
38
16
0
23 Apr 2024
12
Next