Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

23 March 2018

Yuxuan Wang

Rif A. Saurous

Papers citing "Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis"

50 / 275 papers shown

Title
Spotlight-TTS: Spotlighting the Style via Voiced-Aware Style Extraction and Style Direction Adjustment for Expressive Text-to-Speech Nam-Gyu Kim Deok-Hyeon Cho Seung-Bin Kim Seong-Whan Lee 60 0 0 27 May 2025
GSA-TTS : Toward Zero-Shot Speech Synthesis based on Gradual Style Adaptor Seokgi Lee Jungjun Kim TTA 111 0 0 26 May 2025
Audio-to-Audio Emotion Conversion With Pitch And Duration Style Transfer Soumya Dutta Avni Jain Sriram Ganapathy 119 0 0 23 May 2025
On the Cost and Benefits of Training Context with Utterance or Full Conversation Training: A Comparative Stud Hyouin Liu Zhikuan Zhang 70 0 0 12 May 2025
ReverBERT: A State Space Model for Efficient Text-Driven Speech Style Transfer Michael Brown Sofia Martinez Priya Singh 72 0 0 26 Mar 2025
Serenade: A Singing Style Conversion Framework Based On Audio Infilling Lester Phillip Violeta Wen-Chin Huang Tomoki Toda 67 0 0 16 Mar 2025
A Unit-based System and Dataset for Expressive Direct Speech-to-Speech Translation Anna Min Chenxu Hu Yi Ren Hang Zhao 96 0 0 01 Feb 2025
VoicePrompter: Robust Zero-Shot Voice Conversion with Voice Prompt and Conditional Flow Matching Ha-Yeong Choi Jaehan Park 169 0 0 29 Jan 2025
TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer Vladimir Bataev Subhankar Ghosh Vitaly Lavrukhin Jason Chun Lok Li AI4TS 118 1 0 10 Jan 2025
ZSVC: Zero-shot Style Voice Conversion with Disentangled Latent Diffusion Models and Adversarial Training Xinfa Zhu Lei He Yujia Xiao Xi Wang Xu Tan Sheng Zhao Lei Xie DiffM 102 2 0 08 Jan 2025
EmoSphere++: Emotion-Controllable Zero-Shot Text-to-Speech via Emotion-Adaptive Spherical Vector Deok-Hyeon Cho Hyung-Seok Oh Seung-Bin Kim Seong-Whan Lee 133 8 0 04 Nov 2024
The First VoicePrivacy Attacker Challenge Evaluation Plan N. Tomashenko Xiaoxiao Miao Emmanuel Vincent Junichi Yamagishi 257 3 0 09 Oct 2024
NTU-NPU System for Voice Privacy 2024 Challenge Nikita Kuzmin Hieu-Thi Luong Jixun Yao Lei Xie Kong Aik Lee Eng Siong Chng 108 1 0 03 Oct 2024
Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions Kun Zhou You Zhang Shengkui Zhao Hao Wang Zexu Pan ... Chongjia Ni Yukun Ma Trung Hieu Nguyen J. Yip Bin Ma 127 7 0 25 Sep 2024
Adapting General Disentanglement-Based Speaker Anonymization for Enhanced Emotion Preservation Xiaoxiao Miao Yuxiang Zhang Xin Wang N. Tomashenko D. Soh Ian Mcloughlin 116 2 0 12 Aug 2024
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models Chankyu Lee Rajarshi Roy Mengyao Xu Jonathan Raiman Mohammad Shoeybi Bryan Catanzaro Ming-Yu Liu RALM 308 205 0 27 May 2024
Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition Rendi Chevi Alham Fikri Aji 108 3 0 22 Feb 2024
ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations Cheng Gong Xin Wang Erica Cooper Dan Wells Longbiao Wang Jianwu Dang Korin Richmond Junichi Yamagishi 116 25 0 22 Dec 2023
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling Rui Liu Yifan Hu Yi Ren Xiang Yin Haizhou Li 97 19 0 19 Dec 2023
Learning Disentangled Speech Representations Yusuf Brima U. Krumnack Simone Pika Gunther Heidemann CoGe DRL 138 3 0 04 Nov 2023
Prosody Analysis of Audiobooks Charuta Pethe Yunting Yin Felix D Childress Yunting Yin Steven Skiena 89 1 0 10 Oct 2023
PromptVC: Flexible Stylistic Voice Conversion in Latent Space Driven by Natural Language Prompts Jixun Yao Yuguang Yang Yinjiao Lei Ziqian Ning Yanni Hu Yu Pan Jingjing Yin Hongbin Zhou Heng Lu Linfu Xie DiffM 115 23 0 17 Sep 2023
MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023 Zhihang Xu Shaofei Zhang Xi Wang Jiajun Zhang Wenning Wei Lei He Sheng Zhao 81 2 0 06 Sep 2023
CALM: Contrastive Cross-modal Speaking Style Modeling for Expressive Text-to-Speech Synthesis Yi Meng Xiang Li Zhiyong Wu Tingtian Li Zixun Sun Xinyu Xiao Chi Sun Hui Zhan Helen Meng 62 1 0 30 Aug 2023
Real-time Detection of AI-Generated Speech for DeepFake Voice Conversion Jordan J. Bird Ahmad Lotfi 55 19 0 24 Aug 2023
DiffProsody: Diffusion-based Latent Prosody Generation for Expressive Speech Synthesis with Prosody Conditional Adversarial Training H. Oh Sang-Hoon Lee Seong-Whan Lee DiffM 102 16 0 31 Jul 2023
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias Ziyue Jiang Yi Ren Zhe Ye Jinglin Liu Chen Zhang ... Rongjie Huang Chunfeng Wang Xiang Yin Zejun Ma Zhou Zhao DiffM 105 80 0 06 Jun 2023
EmoMix: Emotion Mixing via Diffusion Models for Emotional Speech Synthesis Haobin Tang Xulong Zhang Jianzong Wang Ning Cheng Jing Xiao DiffM 106 27 0 01 Jun 2023
Controllable Speaking Styles Using a Large Language Model A. Sigurgeirsson Simon King 55 3 0 17 May 2023
Vocal Style Factorization for Effective Speaker Recognition in Affective Scenarios Morgan Sandler Arun Ross CVBM 61 0 0 13 May 2023
Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings Wei Xue Yiwen Wang Qi-fei Liu Yi-Ting Guo 73 1 0 09 May 2023
M2-CTTS: End-to-End Multi-scale Multi-modal Conversational Text-to-Speech Synthesis Jinlong Xue Yayue Deng Fengping Wang Ya Li Yingming Gao J. Tao Jianqing Sun Jiaen Liang 68 10 0 03 May 2023
Zero-shot text-to-speech synthesis conditioned using self-supervised speech representation model Kenichi Fujita Takanori Ashihara Hiroki Kanagawa Takafumi Moriya Yusuke Ijima 88 11 0 24 Apr 2023
Context-aware Coherent Speaking Style Prediction with Hierarchical Transformers for Audiobook Speech Synthesis Shunwei Lei Yixuan Zhou Liyang Chen Zhiyong Wu Shiyin Kang Helen Meng 84 6 0 13 Apr 2023
Improving Prosody for Cross-Speaker Style Transfer by Semi-Supervised Style Extractor and Hierarchical Modeling in Speech Synthesis Chunyu Qiang Peng Yang Hao Che Ying Zhang Xiaorui Wang Zhong-ming Wang 77 9 0 14 Mar 2023
Do Prosody Transfer Models Transfer Prosody? A. Sigurgeirsson Simon King DiffM 65 8 0 07 Mar 2023
FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model Rui Xue Yanqing Liu Lei He Xuejiao Tan Linquan Liu Ed Lin Sheng Zhao 118 7 0 06 Mar 2023
An investigation into the adaptability of a diffusion-based TTS model Haolin Chen Philip N. Garner DiffM 68 1 0 03 Mar 2023
Fine-grained Emotional Control of Text-To-Speech: Learning To Rank Inter- And Intra-Class Emotion Intensities Shijun Wang Jón Guðnason Damian Borth 83 10 0 02 Mar 2023
InstructTTS: Modelling Expressive TTS in Discrete Latent Space with Natural Language Style Prompt Dongchao Yang Songxiang Liu Rongjie Huang Chao Weng Helen Meng DiffM VLM 89 102 0 31 Jan 2023
A Comprehensive Review of Data-Driven Co-Speech Gesture Generation Simbarashe Nyatsanga Taras Kucherenko Chaitanya Ahuja G. Henter Michael Neff SLR 114 94 0 13 Jan 2023
UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice Conversion Hao Liu Tao Wang Ruibo Fu Jiangyan Yi Zhengqi Wen J. Tao 111 3 0 10 Jan 2023
Generative Emotional AI for Speech Emotion Recognition: The Case for Synthetic Emotional Speech Augmentation Abdullah Shahid S. Latif Junaid Qadir 64 23 0 10 Jan 2023
Emotion Selectable End-to-End Text-based Speech Editing Tao Wang Jiangyan Yi Ruibo Fu J. Tao Zhengqi Wen Chu Yuan Zhang 76 2 0 20 Dec 2022
Disentangling Prosody Representations with Unsupervised Speech Reconstruction Leyuan Qu Taiha Li C. Weber Theresa Pekarek-Rosin F. Ren S. Wermter 85 10 0 14 Dec 2022
Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis Chunyu Qiang Peng Yang Hao Che Xiaorui Wang Zhongyuan Wang BDL 71 6 0 13 Dec 2022
SNAC: Speaker-normalized affine coupling layer in flow-based architecture for zero-shot multi-speaker text-to-speech Byoung Jin Choi Myeonghun Jeong Joun Yeop Lee N. Kim 104 13 0 30 Nov 2022
Multi-Speaker Expressive Speech Synthesis via Multiple Factors Decoupling Xinfa Zhu Yinjiao Lei Kun Song Yongmao Zhang Tao Li Linfu Xie 75 17 0 19 Nov 2022
Robust Vocal Quality Feature Embeddings for Dysphonic Voice Detection Jianwei Zhang J. Liss Suren Jayasuriya Visar Berisha 66 8 0 17 Nov 2022
Improving Speech Emotion Recognition with Unsupervised Speaking Style Transfer Leyuan Qu Wei Wang C. Weber F. Ren Taiha Li S. Wermter 40 1 0 16 Nov 2022