ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2312.15821
  4. Cited By
Audiobox: Unified Audio Generation with Natural Language Prompts

Audiobox: Unified Audio Generation with Natural Language Prompts

25 December 2023
Apoorv Vyas
Bowen Shi
Matt Le
Andros Tjandra
Yi-Chiao Wu
Baishan Guo
Jiemin Zhang
Xinyue Zhang
Robert Adkins
W.K.F. Ngan
Jeff Wang
Ivan Cruz
Bapi Akula
A. Akinyemi
Brian Ellis
Rashel Moritz
Yael Yungster
Alice Rakotoarison
Liang Tan
Chris Summers
Carleigh Wood
Joshua Lane
Mary Williamson
Wei-Ning Hsu
ArXivPDFHTML

Papers citing "Audiobox: Unified Audio Generation with Natural Language Prompts"

50 / 59 papers shown
Title
OmniAudio: Generating Spatial Audio from 360-Degree Video
OmniAudio: Generating Spatial Audio from 360-Degree Video
Huadai Liu
Tianyi Luo
Qikai Jiang
Kaicheng Luo
Peiwen Sun
...
X. Li
Shiliang Zhang
Zhijie Yan
Zhou Zhao
Wei Xue
VGen
51
0
0
21 Apr 2025
TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis
TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis
Tri Ton
Ji Woo Hong
Chang D. Yoo
VGen
24
0
0
08 Apr 2025
Measuring the Robustness of Audio Deepfake Detectors
Measuring the Robustness of Audio Deepfake Detectors
Xiang Li
Pin-Yu Chen
Wenqi Wei
31
0
0
21 Mar 2025
MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation
Sungwoo Cho
J. Choi
Sungnyun Kim
Se-Young Yun
54
0
0
14 Mar 2025
Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens
X. Wang
Mingqi Jiang
Z. Ma
Ziyu Zhang
S. Liu
...
Zhifei Li
Xie Chen
Lei Xie
Y. Guo
Wei Xue
73
10
0
03 Mar 2025
UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation
Alexander H. Liu
Sang-gil Lee
Chao-Han Huck Yang
Yuan Gong
Yu-Chun Wang
James Glass
Rafael Valle
Bryan Catanzaro
SSL
44
0
0
02 Mar 2025
TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching
TechSinger: Technique Controllable Multilingual Singing Voice Synthesis via Flow Matching
Wenxiang Guo
Yu Zhang
Changhao Pan
Rongjie Huang
Li Tang
Ruiqi Li
Zhiqing Hong
Yongqi Wang
Zhou Zhao
93
2
0
18 Feb 2025
Video Latent Flow Matching: Optimal Polynomial Projections for Video Interpolation and Extrapolation
Video Latent Flow Matching: Optimal Polynomial Projections for Video Interpolation and Extrapolation
Yang Cao
Zhao-quan Song
Chiwun Yang
VGen
44
2
0
01 Feb 2025
Audio-Language Datasets of Scenes and Events: A Survey
Audio-Language Datasets of Scenes and Events: A Survey
Gijs Wijngaard
Elia Formisano
Michele Esposito
M. Dumontier
79
2
0
10 Jan 2025
FlowSep: Language-Queried Sound Separation with Rectified Flow Matching
FlowSep: Language-Queried Sound Separation with Rectified Flow Matching
Yi Yuan
Xubo Liu
Haohe Liu
Mark D. Plumbley
Wenwu Wang
52
3
0
10 Jan 2025
SongEditor: Adapting Zero-Shot Song Generation Language Model as a Multi-Task Editor
SongEditor: Adapting Zero-Shot Song Generation Language Model as a Multi-Task Editor
Chenyu Yang
Shuai Wang
Hangting Chen
Jianwei Yu
Wei Tan
Rongzhi Gu
Y. Xu
Yizhi Zhou
Haina Zhu
H. Li
KELM
103
1
0
18 Dec 2024
SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from
  Text
SyncFlow: Toward Temporally Aligned Joint Audio-Video Generation from Text
Haohe Liu
Gaël Le Lan
Xinhao Mei
Zhaoheng Ni
Anurag Kumar
Varun K. Nagaraja
Wenwu Wang
Mark D. Plumbley
Yangyang Shi
Vikas Chandra
VGen
61
1
0
03 Dec 2024
Wasserstein Flow Matching: Generative modeling over families of
  distributions
Wasserstein Flow Matching: Generative modeling over families of distributions
D. Haviv
Aram-Alexandre Pooladian
D. Pe’er
Brandon Amos
OOD
29
0
0
01 Nov 2024
A Closer Look at Neural Codec Resynthesis: Bridging the Gap between
  Codec and Waveform Generation
A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation
Alexander H. Liu
Qirui Wang
Yuan Gong
James Glass
25
0
0
29 Oct 2024
Revisiting Reliability in Large-Scale Machine Learning Research Clusters
Revisiting Reliability in Large-Scale Machine Learning Research Clusters
Apostolos Kokolis
Michael Kuchnik
John Hoffman
Adithya Kumar
Parth Malani
Faye Ma
Zachary DeVito
S.
Kalyan Saladi
Carole-Jean Wu
68
7
0
29 Oct 2024
Meta-Learning Approaches for Improving Detection of Unseen Speech
  Deepfakes
Meta-Learning Approaches for Improving Detection of Unseen Speech Deepfakes
Ivan Kukanov
Janne Laakkonen
Tomi Kinnunen
Ville Hautamaki
AAML
24
0
0
27 Oct 2024
MusicFlow: Cascaded Flow Matching for Text Guided Music Generation
MusicFlow: Cascaded Flow Matching for Text Guided Music Generation
K R Prajwal
Bowen Shi
Matthew Lee
Apoorv Vyas
Andros Tjandra
...
Baishan Guo
Huiyu Wang
Triantafyllos Afouras
David Kant
Wei-Ning Hsu
30
5
0
27 Oct 2024
Enhancing Robustness in Deep Reinforcement Learning: A Lyapunov Exponent
  Approach
Enhancing Robustness in Deep Reinforcement Learning: A Lyapunov Exponent Approach
Rory Young
Nicolas Pugeault
AAML
57
3
0
14 Oct 2024
SRC-gAudio: Sampling-Rate-Controlled Audio Generation
SRC-gAudio: Sampling-Rate-Controlled Audio Generation
Chenxing Li
Manjie Xu
Dong Yu
DiffM
26
0
0
09 Oct 2024
HALL-E: Hierarchical Neural Codec Language Model for Minute-Long
  Zero-Shot Text-to-Speech Synthesis
HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis
Yuto Nishimura
Takumi Hirose
Masanari Ohi
Hideki Nakayama
Nakamasa Inoue
VLM
29
1
0
06 Oct 2024
Did You Hear That? Introducing AADG: A Framework for Generating
  Benchmark Data in Audio Anomaly Detection
Did You Hear That? Introducing AADG: A Framework for Generating Benchmark Data in Audio Anomaly Detection
Ksheeraja Raghavan
Samiran Gode
Ankit Parag Shah
Surabhi Raghavan
Wolfram Burgard
Bhiksha Raj
Rita Singh
25
0
0
04 Oct 2024
Description-based Controllable Text-to-Speech with Cross-Lingual Voice
  Control
Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control
Ryuichi Yamamoto
Yuma Shirahata
Masaya Kawamura
Kentaro Tachibana
DiffM
27
2
0
26 Sep 2024
Generative Speech Foundation Model Pretraining for High-Quality Speech
  Extraction and Restoration
Generative Speech Foundation Model Pretraining for High-Quality Speech Extraction and Restoration
Pin-Jui Ku
Alexander H. Liu
Roman Korostik
Sung-Feng Huang
Szu-Wei Fu
Ante Jukić
31
2
0
24 Sep 2024
Room Impulse Responses help attackers to evade Deep Fake Detection
Room Impulse Responses help attackers to evade Deep Fake Detection
Hieu-Thi Luong
Duc-Tuan Truong
Kong Aik Lee
Eng Siong Chng
30
1
0
23 Sep 2024
Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In
  Video-to-Audio Synthesis
Rhythmic Foley: A Framework For Seamless Audio-Visual Alignment In Video-to-Audio Synthesis
Zhiqi Huang
Dan Luo
Jun Wang
Huan Liao
Zhiheng Li
Zhiyong Wu
VGen
45
4
0
13 Sep 2024
VMAS: Video-to-Music Generation via Semantic Alignment in Web Music
  Videos
VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos
Yan-Bo Lin
Yu Tian
L. Yang
Gedas Bertasius
Heng Wang
VGen
34
7
0
11 Sep 2024
SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural
  Language Description
SpeechCraft: A Fine-grained Expressive Speech Dataset with Natural Language Description
Zeyu Jin
Jia Jia
Qixin Wang
Kehan Li
Shuoyi Zhou
Songtao Zhou
Xiaoyu Qin
Zhiyong Wu
24
10
0
24 Aug 2024
Autoregressive Speech Synthesis without Vector Quantization
Autoregressive Speech Synthesis without Vector Quantization
Lingwei Meng
Long Zhou
Shujie Liu
Sanyuan Chen
Bing Han
...
Jinyu Li
Sheng Zhao
Xixin Wu
Helen Meng
Furu Wei
38
30
0
11 Jul 2024
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of
  Audio Events in Text-to-audio Generation
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of Audio Events in Text-to-audio Generation
Zeyu Xie
Xuenan Xu
Zhizheng Wu
Mengyue Wu
27
7
0
03 Jul 2024
Towards Probing Speech-Specific Risks in Large Multimodal Models: A
  Taxonomy, Benchmark, and Insights
Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights
Hao Yang
Lizhen Qu
Ehsan Shareghi
Gholamreza Haffari
28
0
0
25 Jun 2024
Joint Audio and Symbolic Conditioning for Temporally Controlled
  Text-to-Music Generation
Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation
Or Tal
Alon Ziv
Itai Gat
Felix Kreuk
Yossi Adi
37
13
0
16 Jun 2024
Articulatory Phonetics Informed Controllable Expressive Speech Synthesis
Articulatory Phonetics Informed Controllable Expressive Speech Synthesis
Zehua Kcriss Li
Meiying Melissa Chen
Yi Zhong
Pinxin Liu
Zhiyao Duan
26
0
0
15 Jun 2024
LAFMA: A Latent Flow Matching Model for Text-to-Audio Generation
LAFMA: A Latent Flow Matching Model for Text-to-Audio Generation
Wenhao Guan
K. Wang
Wangjin Zhou
Yang Wang
Feng Deng
Hui Wang
Lin Li
Q. Hong
Yong Qin
DiffM
28
3
0
12 Jun 2024
FakeSound: Deepfake General Audio Detection
FakeSound: Deepfake General Audio Detection
Zeyu Xie
Baihan Li
Xuenan Xu
Zheng Liang
Kai Yu
Mengyue Wu
21
1
0
12 Jun 2024
LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts
  for Text-to-Speech and Style Captioning
LibriTTS-P: A Corpus with Speaking Style and Speaker Identity Prompts for Text-to-Speech and Style Captioning
Masaya Kawamura
Ryuichi Yamamoto
Yuma Shirahata
Takuya Hasumi
Kentaro Tachibana
VLM
22
5
0
12 Jun 2024
An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot
  TTS
An Investigation of Noise Robustness for Flow-Matching-Based Zero-Shot TTS
Xiaofei Wang
Sefik Emre Eskimez
Manthan Thakker
Hemin Yang
Zirun Zhu
...
Yufei Xia
Jinzhu Li
Sheng Zhao
Jinyu Li
Naoyuki Kanda
27
3
0
09 Jun 2024
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text
  to Speech Synthesizers
VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers
Sanyuan Chen
Shujie Liu
Long Zhou
Yanqing Liu
Xu Tan
Jinyu Li
Sheng Zhao
Yao Qian
Furu Wei
VLM
29
64
0
08 Jun 2024
Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing
  Conversion
Self-Supervised Singing Voice Pre-Training towards Speech-to-Singing Conversion
Ruiqi Li
Rongjie Huang
Yongqi Wang
Zhiqing Hong
Zhou Zhao
29
1
0
04 Jun 2024
ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and
  Zero-shot Language Style Control With Decoupled Codec
ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec
Shengpeng Ji
Jia-li Zuo
Minghui Fang
Siqi Zheng
Qian Chen
...
Ziyue Jiang
Hai Huang
Xize Cheng
Rongjie Huang
Zhou Zhao
45
7
0
03 Jun 2024
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching
Yongqi Wang
Wenxiang Guo
Rongjie Huang
Jia-Bin Huang
Zehan Wang
Fuming You
Ruiqi Li
Zhou Zhao
VGen
DiffM
26
11
0
01 Jun 2024
Multi-speaker Text-to-speech Training with Speaker Anonymized Data
Multi-speaker Text-to-speech Training with Speaker Anonymized Data
Wen-Chin Huang
Yi-Chiao Wu
T. Toda
32
1
0
20 May 2024
Prompt-guided Precise Audio Editing with Diffusion Models
Prompt-guided Precise Audio Editing with Diffusion Models
Manjie Xu
Chenxing Li
Duzhen Zhang
Dan Su
Weihan Liang
Dong Yu
DiffM
31
4
0
11 May 2024
FlashSpeech: Efficient Zero-Shot Speech Synthesis
FlashSpeech: Efficient Zero-Shot Speech Synthesis
Zhen Ye
Zeqian Ju
Haohe Liu
Xu Tan
Jianyi Chen
...
Weizhen Bian
Shulin He
Qi-fei Liu
Yi-Ting Guo
Wei Xue
35
16
0
23 Apr 2024
Voice Attribute Editing with Text Prompt
Voice Attribute Editing with Text Prompt
Zheng-Yan Sheng
Yang Ai
Li-Juan Liu
Jia Pan
Zhenhua Ling
26
4
0
13 Apr 2024
Synthetic training set generation using text-to-audio models for
  environmental sound classification
Synthetic training set generation using text-to-audio models for environmental sound classification
Francesca Ronchini
Luca Comanducci
Fabio Antonacci
28
2
0
26 Mar 2024
HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot
  Text-to-Speech with Model and Data Scaling
HAM-TTS: Hierarchical Acoustic Modeling for Token-Based Zero-Shot Text-to-Speech with Model and Data Scaling
Chunhui Wang
Chang Zeng
Bowen Zhang
Ziyang Ma
Yefan Zhu
Zifeng Cai
Jian Zhao
Zhonglin Jiang
Yong Chen
SyDa
36
5
0
09 Mar 2024
Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like
Making Flow-Matching-Based Zero-Shot Text-to-Speech Laugh as You Like
Naoyuki Kanda
Xiaofei Wang
Sefik Emre Eskimez
Manthan Thakker
Hemin Yang
...
Yufei Xia
Jinzhu Li
Yanqing Liu
Sheng Zhao
Michael Zeng
19
8
0
12 Feb 2024
Fast Timing-Conditioned Latent Audio Diffusion
Fast Timing-Conditioned Latent Audio Diffusion
Zach Evans
CJ Carr
Josiah Taylor
Scott H. Hawley
Jordi Pons
DiffM
74
98
0
07 Feb 2024
Natural language guidance of high-fidelity text-to-speech with synthetic
  annotations
Natural language guidance of high-fidelity text-to-speech with synthetic annotations
Daniel Lyth
Simon King
16
35
0
02 Feb 2024
Proactive Detection of Voice Cloning with Localized Watermarking
Proactive Detection of Voice Cloning with Localized Watermarking
Robin San Roman
Pierre Fernandez
Alexandre Défossez
Teddy Furon
Tuan Tran
Hady ElSahar
35
39
0
30 Jan 2024
12
Next