Title
Generalized Source Tracing: Detecting Novel Audio Deepfake Algorithm with Real Emphasis and Fake Dispersion Strategy Yuankun Xie Ruibo Fu Zhengqi Wen Zhiyong Wang Xiaopeng Wang Haonnan Cheng Long Ye Jianhua Tao 34 2 0 05 Jun 2024
ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec Shengpeng Ji Jia-li Zuo Minghui Fang Siqi Zheng Qian Chen ... Ziyue Jiang Hai Huang Xize Cheng Rongjie Huang Zhou Zhao 45 8 0 03 Jun 2024
The Codecfake Dataset and Countermeasures for the Universally Detection of Deepfake Audio Yuankun Xie Yi Lu Ruibo Fu Zhengqi Wen Zhiyong Wang ... Xiaopeng Wang Yukun Liu Haonan Cheng Long Ye Yi Sun 47 15 0 08 May 2024
SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound Haohe Liu Xuenan Xu Yiitan Yuan Mengyue Wu Wenwu Wang Mark D. Plumbley 27 18 0 30 Apr 2024
Fake it to make it: Using synthetic data to remedy the data shortage in joint multimodal speech-and-gesture synthesis Shivam Mehta Anna Deichler Jim O'Regan Birger Moëll Jonas Beskow G. Henter Simon Alexanderson 34 4 0 30 Apr 2024
USAT: A Universal Speaker-Adaptive Text-to-Speech Approach Wenbin Wang Yang Song Sanjay Jha 32 10 0 28 Apr 2024
Interactive tools for making temporally variable, multiple-attributes, and multiple-instances morphing accessible: Flexible manipulation of divergent speech instances for explorational research and education Hideki Kawahara Masanori Morise 26 1 0 20 Apr 2024
Neural Flow Diffusion Models: Learnable Forward Process for Improved Diffusion Modelling Grigory Bartosh Dmitry Vetrov C. A. Naesseth DiffM 29 7 0 19 Apr 2024
Gull: A Generative Multifunctional Audio Codec Yi Luo Jianwei Yu Hangting Chen Rongzhi Gu Chao Weng AuLLM 33 3 0 07 Apr 2024
CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech Jaehyeon Kim Keon Lee Seungjun Chung Jaewoong Cho 65 39 0 03 Apr 2024
EM-TTS: Efficiently Trained Low-Resource Mongolian Lightweight Text-to-Speech Ziqi Liang Haoxiang Shi Jiawei Wang Keda Lu 30 0 0 13 Mar 2024
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models Zeqian Ju Yuancheng Wang Kai Shen Xu Tan Detai Xin ... Shikun Zhang Jiang Bian Lei He Jinyu Li Sheng Zhao DiffM 33 143 0 05 Mar 2024
VoxGenesis: Unsupervised Discovery of Latent Speaker Manifold for Speech Synthesis Wei-wei Lin Chenhang He Man-Wai Mak Jiachen Lian Kong Aik Lee DiffM 39 0 0 01 Mar 2024
StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing Gaoxiang Cong Yuankai Qi Liang-Sheng Li Amin Beheshti Zhedong Zhang A. Hengel Ming-Hsuan Yang Chenggang Yan Qingming Huang 38 12 0 20 Feb 2024
Data-driven grapheme-to-phoneme representations for a lexicon-free text-to-speech Abhinav Garg Jiyeon Kim Sushil Khyalia Chanwoo Kim Dhananjaya N. Gowda 12 2 0 19 Jan 2024
SonicVisionLM: Playing Sound with Vision Language Models Zhifeng Xie Shengye Yu Qile He Mengtian Li VLM VGen 28 2 0 09 Jan 2024
ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations Cheng Gong Xin Wang Erica Cooper Dan Wells Longbiao Wang Jianwu Dang Korin Richmond Junichi Yamagishi 24 20 0 22 Dec 2023
MM-TTS: Multi-modal Prompt based Style Transfer for Expressive Text-to-Speech Synthesis Wenhao Guan Yishuang Li Tao Li Hukai Huang Feng Wang Jiayan Lin Lingyan Huang Lin Li Q. Hong 23 8 0 17 Dec 2023
HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis Sang-Hoon Lee Haram Choi Seung-Bin Kim Seong-Whan Lee BDL 25 31 0 21 Nov 2023
ELF: Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis Jungil Kong Junmo Lee Jeongmin Kim Beomjeong Kim Jihoon Park Dohee Kong Changheon Lee Sangjin Kim 21 1 0 20 Nov 2023
Controllable Generation of Artificial Speaker Embeddings through Discovery of Principal Directions Florian Lux Pascal Tilli Sarina Meyer Ngoc Thang Vu 15 2 0 26 Oct 2023
DASpeech: Directed Acyclic Transformer for Fast and High-quality Speech-to-Speech Translation Qingkai Fang Yan Zhou Yangzhou Feng 32 6 0 11 Oct 2023
Towards human-like spoken dialogue generation between AI agents from written dialogue Kentaro Mitsui Yukiya Hono Kei Sawada 29 13 0 02 Oct 2023
VoiceLDM: Text-to-Speech with Environmental Context Yeong-Won Lee In-won Yeon Juhan Nam Joon Son Chung VLM DiffM 16 10 0 24 Sep 2023
PromptTTS++: Controlling Speaker Identity in Prompt-Based Text-to-Speech Using Natural Language Descriptions Reo Shimizu Ryuichi Yamamoto Masaya Kawamura Yuma Shirahata Hironori Doi Tatsuya Komatsu Kentaro Tachibana DiffM 16 19 0 15 Sep 2023
An Efficient Temporary Deepfake Location Approach Based Embeddings for Partially Spoofed Audio Detection Yuankun Xie Haonan Cheng Yutian Wang Long Ye 27 6 0 06 Sep 2023
MuLanTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2023 Zhihang Xu Shaofei Zhang Xi Wang Jiajun Zhang Wenning Wei Lei He Sheng Zhao 16 2 0 06 Sep 2023
PromptTTS 2: Describing and Generating Voices with Text Prompt Yichong Leng Zhifang Guo Kai Shen Xu Tan Zeqian Ju ... Lei He Xiang-Yang Li Sheng Zhao Tao Qin Jiang Bian VLM DiffM 37 40 0 05 Sep 2023
Enhancing the vocal range of single-speaker singing voice synthesis with melody-unsupervised pre-training Shaohuan Zhou Xu Li Zhiyong Wu Yin Shan H. Meng 14 2 0 01 Sep 2023
The DeepZen Speech Synthesis System for Blizzard Challenge 2023 C. Veaux R. Maia Spyridoula Papendreou 16 1 0 30 Aug 2023
iSTFTNet2: Faster and More Lightweight iSTFT-Based Neural Vocoder Using 1D-2D CNN Takuhiro Kaneko Hirokazu Kameoka Kou Tanaka Shogo Seki 15 4 0 14 Aug 2023
AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining Haohe Liu Yiitan Yuan Xubo Liu Xinhao Mei Qiuqiang Kong Qiao Tian Yuping Wang Wenwu Wang Yuxuan Wang Mark D. Plumbley DiffM 22 220 0 10 Aug 2023
MusicLDM: Enhancing Novelty in Text-to-Music Generation Using Beat-Synchronous Mixup Strategies K. Chen Yusong Wu Haohe Liu Marianna Nezhurina Taylor Berg-Kirkpatrick Shlomo Dubnov DiffM 28 74 0 03 Aug 2023
SALTTS: Leveraging Self-Supervised Speech Representations for improved Text-to-Speech Synthesis Ramanan Sivaguru Vasista Sai Lodagala S. Umesh 14 2 0 02 Aug 2023
VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design Jungil Kong Jihoon Park Beomjeong Kim Jeongmin Kim Dohee Kong Sangjin Kim 11 35 0 31 Jul 2023
WavJourney: Compositional Audio Creation with Large Language Models Xubo Liu Zhongkai Zhu Haohe Liu Yiitan Yuan Meng Cui ... Jinhua Liang Yin Cao Qiuqiang Kong Mark D. Plumbley Wenwu Wang AuLLM 21 25 0 26 Jul 2023
ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading Yujia Xiao Shaofei Zhang Xi Wang Xuejiao Tan Lei He Sheng Zhao Frank Soong Tan Lee 17 5 0 03 Jul 2023
EmoSpeech: Guiding FastSpeech2 Towards Emotional Text to Speech Daria Diatlova V. Shutov 23 7 0 28 Jun 2023
Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale Matt Le Apoorv Vyas Bowen Shi Brian Karrer Leda Sari ... Mary Williamson Vimal Manohar Yossi Adi Jay Mahadeokar Wei-Ning Hsu AuLLM 28 264 0 23 Jun 2023
eCat: An End-to-End Model for Multi-Speaker TTS & Many-to-Many Fine-Grained Prosody Transfer Ammar Abbas S. Karlapati Bastian Schnell Penny Karanasou M. G. Moya Amith Nagaraj Ayman Boustati Nicole Peinelt Alexis Moinet Thomas Drugman 17 3 0 20 Jun 2023
Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis Shivam Mehta Siyang Wang Simon Alexanderson Jonas Beskow Éva Székely G. Henter DiffM 24 14 0 15 Jun 2023
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models Yinghao Aaron Li Cong Han Vinay S. Raghavan Gavin Mischler N. Mesgarani VLM DiffM 37 107 0 13 Jun 2023
HiddenSinger: High-Quality Singing Voice Synthesis via Neural Audio Codec and Latent Diffusion Models Ji-Sang Hwang Sang-Hoon Lee Seong-Whan Lee DiffM 25 8 0 12 Jun 2023
High-Fidelity Audio Compression with Improved RVQGAN Rithesh Kumar Prem Seetharaman Alejandro Luebs I. Kumar Kundan Kumar 33 282 0 11 Jun 2023
Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge Wenhao Guan Tao Li Yishuang Li Hukai Huang Q. Hong Lin Li DiffM 24 6 0 07 Jun 2023
An Overview on Generative AI at Scale with Edge-Cloud Computing Yun Cheng Wang Jintang Xue Chengwei Wei C.-C. Jay Kuo 24 30 0 02 Jun 2023
XPhoneBERT: A Pre-trained Multilingual Model for Phoneme Representations for Text-to-Speech L. T. Nguyen Thinh-Le-Gia Pham Dat Quoc Nguyen 12 13 0 31 May 2023
EE-TTS: Emphatic Expressive TTS with Linguistic Information Yifan Zhong Chen Zhang Xule Liu Chenxi Sun Weishan Deng Haifeng Hu Zhongqian Sun 13 3 0 20 May 2023
IMAD: IMage-Augmented multi-modal Dialogue Viktor Moskvoretskii Anton Frolov Denis Kuznetsov 17 3 0 17 May 2023
NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers Kai Shen Zeqian Ju Xu Tan Yanqing Liu Yichong Leng Lei He Tao Qin Sheng Zhao Jiang Bian DiffM 15 221 0 18 Apr 2023