Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

23 March 2018

Yuxuan Wang

Rif A. Saurous

Papers citing "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis"

50 / 275 papers shown

Title
Enhancing Speaking Styles in Conversational Text-to-Speech Synthesis with Graph-based Multi-modal Context Modeling Jingbei Li Yi Meng Chenyi Li Zhiyong Wu Helen Meng Chao Weng Jane Polak Scowcroft 93 24 0 11 Jun 2021
Speech BERT Embedding For Improving Prosody in Neural TTS Liping Chen Yan Deng Xi Wang Frank Soong Lei He 89 23 0 08 Jun 2021
NWT: Towards natural audio-to-video generation with representation learning Rayhane Mama Marc S. Tyndel Hashiam Kadhim Cole Clifford Ragavan Thurairatnam VGen 112 12 0 08 Jun 2021
Data Augmentation Methods for End-to-end Speech Recognition on Distant-Talk Scenarios E. Tsunoo Kentarou Shibata Chaitanya Narisetty Yosuke Kashiwagi Shinji Watanabe 69 12 0 07 Jun 2021
Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation Dong Min Dong Bok Lee Eunho Yang Sung Ju Hwang 134 175 0 06 Jun 2021
Phone-Level Prosody Modelling with GMM-Based MDN for Diverse and Controllable Speech Synthesis Chenpeng Du K. Yu 154 20 0 27 May 2021
Learning Robust Latent Representations for Controllable Speech Synthesis Shakti Kumar Jithin Pradeep Hussain Zaidi DRL 68 6 0 10 May 2021
Exploring emotional prototypes in a high dimensional TTS latent space Pol van Rijn Silvan Mertes Dominik Schiller Peter M. C. Harrison P. Larrouy-Maestri Elisabeth André Nori Jacoby 59 12 0 05 May 2021
Review of end-to-end speech synthesis technology based on deep learning Zhaoxi Mu Xinyu Yang Yizhuo Dong AuLLM ALM 94 25 0 20 Apr 2021
Spectrogram Inpainting for Interactive Generation of Instrument Sounds Théis Bazin Gaëtan Hadjeres P. Esling M. Malt 61 11 0 15 Apr 2021
Towards end-to-end F0 voice conversion based on Dual-GAN with convolutional wavelet kernels Clément Le Moine Veillon Nicolas Obin Axel Roebel 35 8 0 15 Apr 2021
Enhancing Word-Level Semantic Representation via Dependency Structure for Expressive Text-to-Speech Synthesis Yixuan Zhou Changhe Song Jingbei Li Zhiyong Wu Yanyao Bian Jane Polak Scowcroft Helen Meng 103 6 0 14 Apr 2021
Comparing the Benefit of Synthetic Training Data for Various Automatic Speech Recognition Architectures Nick Rossenbach Mohammad Zeineldeen Benedikt Hilmes Ralf Schluter Hermann Ney 72 12 0 12 Apr 2021
Half-Truth: A Partially Fake Audio Detection Dataset Jiangyan Yi Ye Bai J. Tao Haoxin Ma Zhengkun Tian Chenglong Wang Tao Wang Ruibo Fu 71 85 0 08 Apr 2021
Towards Multi-Scale Style Control for Expressive Speech Synthesis Xiang Li Changhe Song Jingbei Li Zhiyong Wu Jia Jia Helen Meng 64 47 0 08 Apr 2021
Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability Rui Liu Berrak Sisman Haizhou Li 69 32 0 03 Apr 2021
Attention Forcing for Machine Translation Qingyun Dou Yiting Lu Potsawee Manakul Xixin Wu Mark Gales 60 7 0 02 Apr 2021
Assem-VC: Realistic Voice Conversion by Assembling Modern Speech Synthesis Techniques Kang-Wook Kim Seung-won Park Junhyeok Lee Myun-chul Joe 76 28 0 02 Apr 2021
PnG BERT: Augmented BERT on Phonemes and Graphemes for Neural TTS Ye Jia Heiga Zen Jonathan Shen Yu Zhang Yonghui Wu SSL 103 84 0 28 Mar 2021
STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech Keon Lee Kyumin Park Daeyoung Kim 69 32 0 17 Mar 2021
Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech C. Chien Jheng-hao Lin Chien-yu Huang Po-Chun Hsu Hung-yi Lee 119 70 0 06 Mar 2021
Adversarially learning disentangled speech representations for robust multi-factor voice conversion Jie Wang Jingbei Li Xintao Zhao Zhiyong Wu Shiyin Kang Helen Meng DRL 123 29 0 30 Jan 2021
Expressive Neural Voice Cloning Paarth Neekhara Shehzeen Samarah Hussain Shlomo Dubnov F. Koushanfar Julian McAuley DiffM 59 30 0 30 Jan 2021
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units Wei-Ning Hsu David Harwath Christopher Song James R. Glass CLIP 90 67 0 31 Dec 2020
The 2020 ESPnet update: new features, broadened applications, performance improvements, and future plans Shinji Watanabe Florian Boyer Xuankai Chang Pengcheng Guo Tomoki Hayashi ... Shigeki Karita Chenda Li Jing Shi Aswin Shanmugam Subramanian Wangyou Zhang VLM 108 38 0 23 Dec 2020
Incremental Text-to-Speech Synthesis Using Pseudo Lookahead with Large Pretrained Language Model Takaaki Saeki Shinnosuke Takamichi Hiroshi Saruwatari 55 16 0 23 Dec 2020
Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis Neeraj Kumar Srishti Goel Ankur Narang Brejesh Lall 68 5 0 14 Dec 2020
DeepTalk: Vocal Style Encoding for Speaker Recognition and Speech Synthesis Anurag Chowdhury Arun Ross Prabu David 28 5 0 09 Dec 2020
Using previous acoustic context to improve Text-to-Speech synthesis Pilar Oplustil Gallegos Simon King 70 11 0 07 Dec 2020
GraphPB: Graphical Representations of Prosody Boundary in Speech Synthesis Aolan Sun Jianzong Wang Ning Cheng Huayi Peng Zhen Zeng Lingwei Kong Jing Xiao 62 9 0 03 Dec 2020
Controllable Emotion Transfer For End-to-End Speech Synthesis Tao Li Shan Yang Liumeng Xue Lei Xie 79 74 0 17 Nov 2020
Fine-grained Emotion Strength Transfer, Control and Prediction for Emotional Speech Synthesis Yinjiao Lei Shan Yang Lei Xie 88 56 0 17 Nov 2020
Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis C. Chien Hung-yi Lee 91 36 0 12 Nov 2020
Spoken Language Interaction with Robots: Research Issues and Recommendations, Report from the NSF Future Directions Workshop M. Marge C. Espy-Wilson Roger K. Moore 77 79 0 11 Nov 2020
Enhancing Low-Quality Voice Recordings Using Disentangled Channel Factor and Neural Waveform Model Haoyu Li Yang Ai Junichi Yamagishi 76 2 0 10 Nov 2020
Pretraining Strategies, Waveform Model Choice, and Acoustic Configurations for Multi-Speaker End-to-End Speech Synthesis Erica Cooper Xin Wang Yi Zhao Yusuke Yasuda Junichi Yamagishi SyDa 50 3 0 10 Nov 2020
Fine-grained Style Modeling, Transfer and Prediction in Text-to-Speech Synthesis via Phone-Level Content-Style Disentanglement Daxin Tan Tan Lee 116 21 0 08 Nov 2020
Improving Prosody Modelling with Cross-Utterance BERT Embeddings for End-to-end Speech Synthesis Guanghui Xu Wei Song Zhengchen Zhang Chao Zhang Xiaodong He Bowen Zhou 62 50 0 06 Nov 2020
Data Augmentation for End-to-end Code-switching Speech Recognition Chenpeng Du Hao Li Yizhou Lu Lan Wang Y. Qian 57 28 0 04 Nov 2020
Speech Synthesis and Control Using Differentiable DSP Giorgio Fabbro Vladimir Golkov Thomas Kemp Zorah Lähner 78 12 0 28 Oct 2020
Seen and Unseen emotional style transfer for voice conversion with a new emotional speech dataset Kun Zhou Berrak Sisman Rui Liu Haizhou Li 105 192 0 28 Oct 2020
Emotion controllable speech synthesis using emotion-unlabeled dataset with the assistance of cross-domain speech emotion recognition Xiong Cai Dongyang Dai Zhiyong Wu Xiang Li Jingbei Li Helen Meng 94 67 0 26 Oct 2020
Unsupervised Learning of Disentangled Speech Content and Style Representation Andros Tjandra Ruoming Pang Yu Zhang Shigeki Karita BDL DRL 73 15 0 24 Oct 2020
AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines Yao Shi Hui Bu Xin Xu Shaojing Zhang Ming Li 112 223 0 22 Oct 2020
Parallel Tacotron: Non-Autoregressive and Controllable TTS Isaac Elias Heiga Zen Jonathan Shen Yu Zhang Ye Jia Ron J. Weiss Yonghui Wu DRL 76 103 0 22 Oct 2020
Hierarchical Multi-Grained Generative Model for Expressive Speech Synthesis Yukiya Hono Kazuna Tsuboi Kei Sawada Kei Hashimoto Keiichiro Oura Yoshihiko Nankaku K. Tokuda BDL 57 24 0 17 Sep 2020
Controllable neural text-to-speech synthesis using intuitive prosodic features T. Raitio Ramya Rasipuram D. Castellani 78 66 0 14 Sep 2020
Modeling Prosodic Phrasing with Multi-Task Learning in Tacotron-based TTS Rui Liu Berrak Sisman F. Bao Guanglai Gao Haizhou Li 41 18 0 11 Aug 2020
Deep MOS Predictor for Synthetic Speech Using Cluster-Based Modeling Yeunju Choi Youngmoon Jung Hoirin Kim 139 26 0 09 Aug 2020
Expressive TTS Training with Frame and Style Reconstruction Loss Rui Liu Berrak Sisman Guanglai Gao Haizhou Li 112 73 0 04 Aug 2020