Title
Voice Cloning: Comprehensive Survey Hussam Azzuni Abdulmotaleb El Saddik VLM 114 0 0 01 May 2025
From Faces to Voices: Learning Hierarchical Representations for High-quality Video-to-Speech Ji-Hoon Kim Jeongsoo Choi Jaehun Kim Chaeyoung Jung Joon Son Chung CVBM 80 1 0 21 Mar 2025
Learning disentangled representations for instrument-based music similarity Yuka Hashizume Li Li Atsushi Miyashita Tomoki Toda 155 0 0 21 Mar 2025
Equivariant Blurring Diffusion for Hierarchical Molecular Conformer Generation Jiwoong Park Yang Shen DiffM 103 1 0 26 Oct 2024
DART: Disentanglement of Accent and Speaker Representation in Multispeaker Text-to-Speech J. Melechovský Ambuj Mehrish Berrak Sisman Dorien Herremans 57 2 0 17 Oct 2024
Investigating Disentanglement in a Phoneme-level Speech Codec for Prosody Modeling Sotirios Karapiperis Nikolaos Ellinas Alexandra Vioni Junkwang Oh Gunu Jho Inchul Hwang S. Raptis 153 0 0 13 Sep 2024
Disentangling segmental and prosodic factors to non-native speech comprehensibility Waris Quamer Ricardo Gutierrez-Osuna 78 1 0 20 Aug 2024
Towards Expressive Zero-Shot Speech Synthesis with Hierarchical Prosody Modeling Yuepeng Jiang Tao Li Fengyu Yang Lei Xie Meng Meng Yujun Wang 73 2 0 09 Jun 2024
Textless Acoustic Model with Self-Supervised Distillation for Noise-Robust Expressive Speech-to-Speech Translation Min-Jae Hwang Ilia Kulikov Benjamin Peloquin Hongyu Gong Peng-Jen Chen Ann Lee 63 3 0 04 Jun 2024
Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training J. Melechovský Ambuj Mehrish Berrak Sisman Dorien Herremans 67 2 0 03 Jun 2024
Learning Multidimensional Disentangled Representations of Instrumental Sounds for Musical Similarity Assessment Yuka Hashizume Li Li Atsushi Miyashita Tomoki Toda 53 3 0 10 Apr 2024
Natural language guidance of high-fidelity text-to-speech with synthetic annotations Daniel Lyth Simon King 100 49 0 02 Feb 2024
ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering Ya-Zhen Song Zhuo Chen Xiaofei Wang Ziyang Ma Xie Chen AuLLM 113 42 0 14 Jan 2024
Audiobox: Unified Audio Generation with Natural Language Prompts Apoorv Vyas Bowen Shi Matt Le Andros Tjandra Yi-Chiao Wu ... Chris Summers Carleigh Wood Joshua Lane Mary Williamson Wei-Ning Hsu 133 94 0 25 Dec 2023
ELF: Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis Jungil Kong Junmo Lee Jeongmin Kim Beomjeong Kim Jihoon Park Dohee Kong Changheon Lee Sangjin Kim 94 1 0 20 Nov 2023
Controllable Generation of Artificial Speaker Embeddings through Discovery of Principal Directions Florian Lux Pascal Tilli Sarina Meyer Ngoc Thang Vu 49 1 0 26 Oct 2023
DPP-TTS: Diversifying prosodic features of speech via determinantal point processes Seongho Joo Hyukhun Koh Kyomin Jung DiffM 95 0 0 23 Oct 2023
Prosody Analysis of Audiobooks Charuta Pethe Yunting Yin Felix D Childress Yunting Yin Steven Skiena 89 1 0 10 Oct 2023
U-Style: Cascading U-nets with Multi-level Speaker and Style Modeling for Zero-Shot Voice Cloning Tao Li Zhichao Wang Xinfa Zhu Jian Cong Qiao Tian Yuping Wang Lei Xie DiffM 77 4 0 06 Oct 2023
Cross-Utterance Conditioned VAE for Speech Generation Yongqian Li Cheng Yu Guangzhi Sun Weiqin Zu Zheng Tian ... Wei Pan Chao Zhang Jun Wang Yang Yang Fanglei Sun 66 2 0 08 Sep 2023
Self-Supervised Disentanglement of Harmonic and Rhythmic Features in Music Audio Signals Yiming Wu CoGe DRL 113 0 0 06 Sep 2023
MSM-VC: High-fidelity Source Style Transfer for Non-Parallel Voice Conversion by Multi-scale Style Modeling Zhichao Wang Xinsheng Wang Qicong Xie Tao Li Linfu Xie Qiao Tian Yuping Wang 114 4 0 03 Sep 2023
Rep2wav: Noise Robust text-to-speech Using self-supervised representations Qiu-shi Zhu Yunting Gu Rilin Chen Chao Weng Yuchen Hu Lirong Dai Jie Zhang AI4TS 81 3 0 28 Aug 2023
MSStyleTTS: Multi-Scale Style Modeling with Hierarchical Context Information for Expressive Speech Synthesis Shunwei Lei Yixuan Zhou Liyang Chen Zhiyong Wu Xixin Wu Shiyin Kang Helen Meng 87 7 0 29 Jul 2023
The Ethical Implications of Generative Audio Models: A Systematic Literature Review J. Barnett 86 32 0 07 Jul 2023
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale Matt Le Apoorv Vyas Bowen Shi Brian Karrer Leda Sari ... Mary Williamson Vimal Manohar Yossi Adi Jay Mahadeokar Wei-Ning Hsu AuLLM 132 306 0 23 Jun 2023
The Age of Synthetic Realities: Challenges and Opportunities J. P. Cardenuto Jing Yang Rafael Padilha Renjie Wan Daniel Moreira Haoliang Li Shiqi Wang Fernanda A. Andaló Sébastien Marcel Anderson de Rezende Rocha DeLMO 115 30 0 09 Jun 2023
StyleS2ST: Zero-shot Style Transfer for Direct Speech-to-speech Translation Kun Song Yi Ren Yinjiao Lei Chunfeng Wang Kun Wei Linfu Xie Xiang Yin Zejun Ma 82 9 0 28 May 2023
Controllable Speaking Styles Using a Large Language Model A. Sigurgeirsson Simon King 55 3 0 17 May 2023
Using Deepfake Technologies for Word Emphasis Detection Eran Kaufman Lee-Ad Gottlieb 59 0 0 12 May 2023
Learn to Sing by Listening: Building Controllable Virtual Singer by Unsupervised Learning from Voice Recordings Wei Xue Yiwen Wang Qi-fei Liu Yi-Ting Guo 73 1 0 09 May 2023
Scientists' Perspectives on the Potential for Generative AI in their Fields Meredith Ringel Morris AI4CE 71 43 0 04 Apr 2023
FoundationTTS: Text-to-Speech for ASR Customization with Generative Language Model Rui Xue Yanqing Liu Lei He Xuejiao Tan Linquan Liu Ed Lin Sheng Zhao 118 7 0 06 Mar 2023
Miipher: A Robust Speech Restoration Model Integrating Self-Supervised Speech and Text Representations Yuma Koizumi Heiga Zen Shigeki Karita Yifan Ding Kohei Yatabe Nobuyuki Morioka Yu Zhang Wei Han Ankur Bapna M. Bacchiani 94 29 0 03 Mar 2023
A Holistic Cascade System, benchmark, and Human Evaluation Protocol for Expressive Speech-to-Speech Translation Wen-Chin Huang Benjamin Peloquin Justine T. Kao Changhan Wang Hongyu Gong Elizabeth Salesky Yossi Adi Ann Lee Peng-Jen Chen 81 16 0 25 Jan 2023
Long-horizon video prediction using a dynamic latent hierarchy Alexey Zakharov Qinghai Guo Zafeirios Fountas 77 4 0 29 Dec 2022
Text-to-speech synthesis based on latent variable conversion using diffusion probabilistic model and variational autoencoder Yusuke Yasuda Tomoki Toda DiffM 79 8 0 16 Dec 2022
Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis Chunyu Qiang Peng Yang Hao Che Xiaorui Wang Zhongyuan Wang BDL 71 6 0 13 Dec 2022
Controllable speech synthesis by learning discrete phoneme-level prosodic representations Nikolaos Ellinas Myrsini Christidou Alexandra Vioni June Sig Sung Aimilios Chalamandaris Pirros Tsiakoulis P. Mastorocostas 66 7 0 29 Nov 2022
Evaluating and reducing the distance between synthetic and real speech distributions Christoph Minixhofer Ondˇrej Klejch P. Bell 82 8 0 29 Nov 2022
Disentangled Representation Learning Xin Eric Wang Hong Chen Siao Tang Zihao Wu Wenwu Zhu DRL 174 87 0 21 Nov 2022
Accented Text-to-Speech Synthesis with a Conditional Variational Autoencoder J. Melechovský Ambuj Mehrish Berrak Sisman Dorien Herremans 83 6 0 07 Nov 2022
Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis Karolos Nikitaras Konstantinos Klapsas Nikolaos Ellinas Georgia Maniati June Sig Sung Inchul Hwang S. Raptis Aimilios Chalamandaris Pirros Tsiakoulis 60 1 0 01 Nov 2022
Towards zero-shot Text-based voice editing using acoustic context conditioning, utterance embeddings, and reference encoders Jason Fong Yun Wang Prabhav Agrawal Vimal Manohar Jilong Wu Thilo Kohler Qing He 50 0 0 28 Oct 2022
Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech Takaaki Saeki Heiga Zen Zhehuai Chen Nobuyuki Morioka Gary Wang Yu Zhang Ankur Bapna Andrew Rosenberg Bhuvana Ramabhadran 130 20 0 27 Oct 2022
Controllable Accented Text-to-Speech Synthesis Rui Liu Berrak Sisman Guanglai Gao Haizhou Li 79 6 0 22 Sep 2022
ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech Saeed Ghorbani Ylva Ferstl Daniel Holden N. Troje M. Carbonneau 123 83 0 15 Sep 2022
The Role of Vocal Persona in Natural and Synthesized Speech Camille Noufi Lloyd May J. Berger 56 2 0 06 Sep 2022
Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks L. Finkelstein Heiga Zen Norman Casagrande Chun-an Chan Ye Jia ... Jonathan Shen V. Wan Yu Zhang Yonghui Wu R. Clark 55 9 0 28 Aug 2022
Pathway to Future Symbiotic Creativity Yi-Ting Guo Qi-fei Liu Jie Chen Wei Xue Jie Fu ... Fernando Rosas Jeffrey Shaw Xing Wu Jiji Zhang Jianliang Xu 66 0 0 18 Aug 2022