Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior

6 February 2020

Guangzhi Sun

Andrew Rosenberg

Bhuvana Ramabhadran

Yonghui Wu

DiffM

ArXiv PDF HTML

Papers citing "Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior"

50 / 60 papers shown

Title
An Exhaustive Evaluation of TTS- and VC-based Data Augmentation for ASR Sewade Ogun Vincent Colotte Emmanuel Vincent 59 0 0 11 Mar 2025
$α$ -TCVAE: On the relationship between Disentanglement and Diversity Cristian Meo Louis Mahon Anirudh Goyal Justin Dauwels DRL 61 8 0 01 Nov 2024
Investigating Disentanglement in a Phoneme-level Speech Codec for Prosody Modeling Sotirios Karapiperis Nikolaos Ellinas Alexandra Vioni Junkwang Oh Gunu Jho Inchul Hwang S. Raptis 33 0 0 13 Sep 2024
VQUNet: Vector Quantization U-Net for Defending Adversarial Atacks by Regularizing Unwanted Noise Zhixun He Mukesh Singhal 28 1 0 05 Jun 2024
CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech Jaehyeon Kim Keon Lee Seungjun Chung Jaewoong Cho 70 39 0 03 Apr 2024
MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis Wenhao Guan Yishuang Li Tao Li Hukai Huang Feng Wang Jiayan Lin Lingyan Huang Lin Li Q. Hong 23 8 0 17 Dec 2023
Improving End-to-End Speech Processing by Efficient Text Data Utilization with Latent Synthesis Jianqiao Lu Wenyong Huang Nianzu Zheng Xingshan Zeng Y. Yeung Xiao Chen SyDa 19 1 0 09 Oct 2023
Cross-Utterance Conditioned VAE for Speech Generation Y. Li Cheng Yu Guangzhi Sun Weiqin Zu Zheng Tian ... Wei Pan Chao Zhang Jun Wang Yang Yang Fanglei Sun 11 2 0 08 Sep 2023
Variational latent discrete representation for time series modelling Max H. Cohen M. Charbit Sylvain Le Corff 25 1 0 27 Jun 2023
Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge Wenhao Guan Tao Li Yishuang Li Hukai Huang Q. Hong Lin Li DiffM 27 6 0 07 Jun 2023
PITS: Variational Pitch Inference without Fundamental Frequency for End-to-End Pitch-controllable TTS Junhyeok Lee Wonbin Jung Hyunjae Cho Jaeyeon Kim Jaehwan Kim 17 3 0 24 Feb 2023
MAC: A unified framework boosting low resource automatic speech recognition Zeping Min Qian Ge Zhong Li E. Weinan 13 1 0 05 Feb 2023
Time out of Mind: Generating Rate of Speech conditioned on emotion and speaker Navjot Kaur Paige Tuttosi 19 2 0 29 Jan 2023
Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis Chunyu Qiang Peng Yang Hao Che Xiaorui Wang Zhongyuan Wang BDL 21 6 0 13 Dec 2022
Controllable speech synthesis by learning discrete phoneme-level prosodic representations Nikolaos Ellinas Myrsini Christidou Alexandra Vioni June Sig Sung Aimilios Chalamandaris Pirros Tsiakoulis P. Mastorocostas 17 7 0 29 Nov 2022
Prosody-controllable spontaneous TTS with neural HMMs Harm Lameris Shivam Mehta G. Henter Joakim Gustafson Éva Székely 33 15 0 24 Nov 2022
Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis Konstantinos Klapsas Karolos Nikitaras Nikolaos Ellinas June Sig Sung Inchul Hwang S. Raptis Aimilios Chalamandaris Pirros Tsiakoulis 19 0 0 02 Nov 2022
Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis Karolos Nikitaras Konstantinos Klapsas Nikolaos Ellinas Georgia Maniati June Sig Sung Inchul Hwang S. Raptis Aimilios Chalamandaris Pirros Tsiakoulis 14 0 0 01 Nov 2022
Speech Synthesis with Mixed Emotions Kun Zhou Berrak Sisman R. Rana B.W.Schuller Haizhou Li 14 43 0 11 Aug 2022
Self-supervised Context-aware Style Representation for Expressive Speech Synthesis Yihan Wu Xi Wang S. Zhang Lei He Ruihua Song J. Nie 34 15 0 25 Jun 2022
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Rongjie Huang Yi Ren Jinglin Liu Chenye Cui Zhou Zhao OODD VLM 115 34 0 15 May 2022
Read the Room: Adapting a Robot's Voice to Ambient and Social Contexts Paige Tuttosi Emma Hughson Akihiro Matsufuji Angelica Lim 20 4 0 10 May 2022
Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech Y. Li Cheng Yu Guangzhi Sun Hua Jiang Fanglei Sun Weiqin Zu Ying Wen Yang Yang Jun Wang 17 7 0 09 May 2022
Layer-wise Fast Adaptation for End-to-End Multi-Accent Speech Recognition Xun Gong Y. Qian Houjun Huang Yanmin Qian 26 44 0 21 Apr 2022
Hierarchical and Multi-Scale Variational Autoencoder for Diverse and Natural Non-Autoregressive Text-to-Speech Jaesung Bae Jinhyeok Yang Taejun Bak Young-Sun Joo DiffM 19 6 0 08 Apr 2022
Unsupervised word-level prosody tagging for controllable speech synthesis Yiwei Guo Chenpeng Du Kai Yu 13 15 0 15 Feb 2022
Disentangling Style and Speaker Attributes for TTS Style Transfer Xiaochun An Frank Soong Lei Xie 54 18 0 24 Jan 2022
Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis Alexandra Vioni Myrsini Christidou Nikolaos Ellinas G. Vamvoukakis Panos Kakoulidis Taehoon Kim June Sig Sung Hyoungmin Park Aimilios Chalamandaris Pirros Tsiakoulis 11 11 0 19 Nov 2021
Word-Level Style Control for Expressive, Non-attentive Speech Synthesis Konstantinos Klapsas Nikolaos Ellinas June Sig Sung Hyoungmin Park S. Raptis 22 9 0 19 Nov 2021
Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control Myrsini Christidou Alexandra Vioni Nikolaos Ellinas G. Vamvoukakis K. Markopoulos Panos Kakoulidis June Sig Sung Hyoungmin Park Aimilios Chalamandaris Pirros Tsiakoulis 19 4 0 19 Nov 2021
DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021 Yanqing Liu Rui Shao G. Wang Kuan Chen Bohan Li P. Yuen Jinzhu Li Lei He Sheng Zhao 32 55 0 25 Oct 2021
Discrete Acoustic Space for an Efficient Sampling in Neural Text-To-Speech Mu-Wei Li Jonas Rohnke A. Bonafonte Mateusz Lajszczak Trevor Wood DRL 17 2 0 24 Oct 2021
Improving Emotional Speech Synthesis by Using SUS-Constrained VAE and Text Encoder Aggregation Fengyu Yang Jian Luan Yujun Wang 30 5 0 19 Oct 2021
Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models Jen-Hao Rick Chang A. Shrivastava H. Koppula Xiaoshuai Zhang Oncel Tuzel DiffM 51 16 0 06 Oct 2021
Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning Rui Li dong Pu Minnie Huang Bill Huang 50 14 0 23 Sep 2021
Multi-Scale Spectrogram Modelling for Neural Text-to-Speech Ammar Abbas Bajibabu Bollepalli Alexis Moinet Arnaud Joly Penny Karanasou Peter Makarov Simon Slangens S. Karlapati Thomas Drugman 16 0 0 29 Jun 2021
A Survey on Neural Speech Synthesis Xu Tan Tao Qin Frank Soong Tie-Yan Liu AI4TS 18 352 0 29 Jun 2021
FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis Taejun Bak Jaesung Bae Hanbin Bae Young-Ik Kim Hoon-Young Cho 20 16 0 29 Jun 2021
Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance Hieu-Thi Luong Junichi Yamagishi 17 0 0 25 Jun 2021
Improving Performance of Seen and Unseen Speech Style Transfer in End-to-end Neural TTS Xiaochun An Frank Soong Lei Xie 34 9 0 18 Jun 2021
Phone-Level Prosody Modelling with GMM-Based MDN for Diverse and Controllable Speech Synthesis Chenpeng Du K. Yu 17 18 0 27 May 2021
EAT: Enhanced ASR-TTS for Self-supervised Speech Recognition M. Baskar L. Burget Shinji Watanabe Ramón Fernández Astudillo J. Černocký 12 26 0 13 Apr 2021
AdaSpeech: Adaptive Text to Speech for Custom Voice Mingjian Chen Xu Tan Bohan Li Yanqing Liu Tao Qin Sheng Zhao Tie-Yan Liu VLM DiffM 23 187 0 01 Mar 2021
AISPEECH-SJTU accent identification system for the Accented English Speech Recognition Challenge Houjun Huang Xu Xiang Yexin Yang Rao Ma Y. Qian 11 25 0 19 Feb 2021
Rich Prosody Diversity Modelling with Phone-level Mixture Density Network Chenpeng Du K. Yu 28 17 0 01 Feb 2021
Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis Neeraj Kumar Srishti Goel Ankur Narang Brejesh Lall 19 5 0 14 Dec 2020
Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis C. Chien Hung-yi Lee 19 36 0 12 Nov 2020
Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis Ron J. Weiss RJ Skerry-Ryan Eric Battenberg Soroosh Mariooryad Diederik P. Kingma 19 97 0 06 Nov 2020
Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis Guanghui Xu Wei Song Zhengchen Zhang Chao Zhang Xiaodong He Bowen Zhou 11 50 0 06 Nov 2020
AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines Yao Shi Hui Bu Xin Xu Shaojing Zhang Ming Li 16 218 0 22 Oct 2020