ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.06687
  4. Cited By
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion
  and Keyword-to-Caption Augmentation

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

12 November 2022
Yusong Wu
K. Chen
Tianyu Zhang
Yuchen Hui
Marianna Nezhurina
Taylor Berg-Kirkpatrick
Shlomo Dubnov
    CLIP
ArXivPDFHTML

Papers citing "Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation"

50 / 342 papers shown
Title
TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining
TACOS: Temporally-aligned Audio CaptiOnS for Language-Audio Pretraining
Paul Primus
Florian Schmid
Gerhard Widmer
CLIP
AI4TS
VLM
18
0
0
12 May 2025
Emotion-Qwen: Training Hybrid Experts for Unified Emotion and General Vision-Language Understanding
Emotion-Qwen: Training Hybrid Experts for Unified Emotion and General Vision-Language Understanding
Dawei Huang
Qing Li
Chuan Yan
Zebang Cheng
Y. Huang
Xiang Li
B. Li
X. U. Wang
Z. Lian
Xiaojiang Peng
24
0
0
10 May 2025
FLAM: Frame-Wise Language-Audio Modeling
FLAM: Frame-Wise Language-Audio Modeling
Yusong Wu
Christos Tsirigotis
Ke Chen
Cheng-Zhi Anna Huang
Aaron C. Courville
Oriol Nieto
Prem Seetharaman
Justin Salamon
43
0
0
08 May 2025
Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization
Hearing and Seeing Through CLIP: A Framework for Self-Supervised Sound Source Localization
Sooyoung Park
Arda Senocak
Joon Son Chung
VLM
43
0
0
08 May 2025
Versatile Framework for Song Generation with Prompt-based Control
Versatile Framework for Song Generation with Prompt-based Control
Y. Zhang
Wenxiang Guo
Changhao Pan
Z. Zhu
Ruiqi Li
...
Rongjie Huang
Ruiyuan Zhang
Zhiqing Hong
Ziyue Jiang
Zhou Zhao
71
1
0
27 Apr 2025
DOSE : Drum One-Shot Extraction from Music Mixture
DOSE : Drum One-Shot Extraction from Music Mixture
Suntae Hwang
Seonghyeon Kang
Kyungsu Kim
Semin Ahn
K. Lee
36
0
0
25 Apr 2025
Unleashing the Power of Natural Audio Featuring Multiple Sound Sources
Unleashing the Power of Natural Audio Featuring Multiple Sound Sources
Xize Cheng
Slytherin Wang
Zehan Wang
Rongjie Huang
Tao Jin
Zhou Zhao
42
0
0
24 Apr 2025
Exploring the User Experience of AI-Assisted Sound Searching Systems for Creative Workflows
Exploring the User Experience of AI-Assisted Sound Searching Systems for Creative Workflows
Haohe Liu
Thomas Deacon
Wenwu Wang
Matt Paradis
Mark D. Plumbley
22
0
0
22 Apr 2025
DRAGON: Distributional Rewards Optimize Diffusion Generative Models
DRAGON: Distributional Rewards Optimize Diffusion Generative Models
Yatong Bai
Jonah Casebeer
Somayeh Sojoudi
Nicholas J. Bryan
DiffM
VLM
39
1
0
21 Apr 2025
Transformation of audio embeddings into interpretable, concept-based representations
Transformation of audio embeddings into interpretable, concept-based representations
Alice Zhang
Edison Thomaz
Lie Lu
27
0
0
18 Apr 2025
A Survey on Cross-Modal Interaction Between Music and Multimodal Data
A Survey on Cross-Modal Interaction Between Music and Multimodal Data
Sifei Li
Mining Tan
Feier Shen
Minyan Luo
Zijiao Yin
Fan Tang
W. Dong
Changsheng Xu
57
0
0
17 Apr 2025
FACT: Foundation Model for Assessing Cancer Tissue Margins with Mass Spectrometry
FACT: Foundation Model for Assessing Cancer Tissue Margins with Mass Spectrometry
Mohammad Farahmand
A. Jamzad
Fahimeh Fooladgar
Laura Connolly
Martin Kaufmann
Kevin Yi Mi Ren
John Rudan
Doug McKay
Gabor Fichtinger
P. Mousavi
31
0
0
15 Apr 2025
TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis
TARO: Timestep-Adaptive Representation Alignment with Onset-Aware Conditioning for Synchronized Video-to-Audio Synthesis
Tri Ton
Ji Woo Hong
Chang D. Yoo
VGen
24
0
0
08 Apr 2025
LoopGen: Training-Free Loopable Music Generation
LoopGen: Training-Free Loopable Music Generation
Davide Marincione
Giorgio Strano
Donato Crisostomi
Roberto Ribuoli
Emanuele Rodolà
MGen
48
0
0
06 Apr 2025
Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
Shivam Mehta
Nebojsa Jojic
Hannes Gamper
31
0
0
28 Mar 2025
Enhance Generation Quality of Flow Matching V2A Model via Multi-Step CoT-Like Guidance and Combined Preference Optimization
Enhance Generation Quality of Flow Matching V2A Model via Multi-Step CoT-Like Guidance and Combined Preference Optimization
Haomin Zhang
S.
Haoyu Wang
Zihao Chen
X. Liu
Chaofan Ding
Xinhan Di
31
0
0
28 Mar 2025
GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations
GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations
Yupei Li
Qiyang Sun
Sunil Munthumoduku Krishna Murthy
Emran Alturki
Björn Schuller
57
0
0
26 Mar 2025
Aligning Text-to-Music Evaluation with Human Preferences
Aligning Text-to-Music Evaluation with Human Preferences
Yichen Huang
Zachary Novack
Koichi Saito
Jiatong Shi
Shinji Watanabe
Yuki Mitsufuji
John Thickstun
Chris Donahue
EGVM
65
1
0
20 Mar 2025
DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap
DiffGAP: A Lightweight Diffusion Module in Contrastive Space for Bridging Cross-Model Gap
Shentong Mo
Zehua Chen
Fan Bao
Jun-Jie Zhu
DiffM
50
0
0
15 Mar 2025
AudioX: Diffusion Transformer for Anything-to-Audio Generation
AudioX: Diffusion Transformer for Anything-to-Audio Generation
Zeyue Tian
Yizhu Jin
Zhaoyang Liu
Ruibin Yuan
Xu Tan
Qifeng Chen
Wei Xue
Y. Guo
65
3
0
13 Mar 2025
M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper
M2R-Whisper: Multi-stage and Multi-scale Retrieval Augmentation for Enhancing Whisper
Jiaming Zhou
S. Zhao
Jiabei He
Hui Wang
Wenjia Zeng
Yong Chen
Haoqin Sun
Aobo Kong
Yong Qin
55
1
0
13 Mar 2025
MACS: Multi-source Audio-to-image Generation with Contextual Significance and Semantic Alignment
Hao Zhou
Xiaobao Guo
Yuzhe Zhu
A. Kong
DiffM
46
1
0
13 Mar 2025
TA-V2A: Textually Assisted Video-to-Audio Generation
Yuhuan You
Xihong Wu
T. Qu
DiffM
45
0
0
12 Mar 2025
A Survey on Knowledge-Oriented Retrieval-Augmented Generation
A Survey on Knowledge-Oriented Retrieval-Augmented Generation
Mingyue Cheng
Yucong Luo
Jie Ouyang
Q. Liu
Huijie Liu
...
Bohou Zhang
Jiawei Cao
Jie Ma
Daoyu Wang
Enhong Chen
3DV
64
3
0
11 Mar 2025
Mellow: a small audio language model for reasoning
Soham Deshmukh
Satvik Dixit
Rita Singh
Bhiksha Raj
AuLLM
ReLM
LRM
75
1
0
11 Mar 2025
MM-StoryAgent: Immersive Narrated Storybook Video Generation with a Multi-Agent Paradigm across Text, Image and Audio
Xuenan Xu
Jiahao Mei
Chenliang Li
Yuning Wu
M. Yan
Shaopeng Lai
J. Zhang
Mengyue Wu
VGen
LLMAG
44
1
0
07 Mar 2025
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities
Sreyan Ghosh
Zhifeng Kong
Sonal Kumar
S. Sakshi
Jaehyeon Kim
Wei Ping
Rafael Valle
Dinesh Manocha
Bryan Catanzaro
MLLM
AuLLM
LRM
49
5
0
06 Mar 2025
GuardDoor: Safeguarding Against Malicious Diffusion Editing via Protective Backdoors
Yaopei Zeng
Yuanpu Cao
Lu Lin
DiffM
WIGM
66
0
0
05 Mar 2025
Language Model Mapping in Multimodal Music Learning: A Grand Challenge Proposal
Daniel Y. Chin
Gus Xia
34
0
0
01 Mar 2025
DGFM: Full Body Dance Generation Driven by Music Foundation Models
DGFM: Full Body Dance Generation Driven by Music Foundation Models
Xinran Liu
Zhenhua Feng
Diptesh Kanojia
Wenwu Wang
DiffM
62
1
0
27 Feb 2025
GCDance: Genre-Controlled 3D Full Body Dance Generation Driven By Music
GCDance: Genre-Controlled 3D Full Body Dance Generation Driven By Music
Xinran Liu
Xu Dong
Diptesh Kanojia
Wenwu Wang
Zhenhua Feng
DiffM
60
0
0
25 Feb 2025
KAD: No More FAD! An Effective and Efficient Evaluation Metric for Audio Generation
KAD: No More FAD! An Effective and Efficient Evaluation Metric for Audio Generation
Yoonjin Chung
Pilsun Eu
Junwon Lee
Keunwoo Choi
Juhan Nam
Ben Sangbae Chon
EGVM
57
3
0
21 Feb 2025
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation
Z. Liu
Shuangrui Ding
Zhixiong Zhang
Xiaoyi Dong
Pan Zhang
Yuhang Zang
Y. Cao
D. Lin
Jiaqi Wang
74
0
0
18 Feb 2025
TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument
TokenSynth: A Token-based Neural Synthesizer for Instrument Cloning and Text-to-Instrument
Kyungsu Kim
Junghyun Koo
Sungho Lee
Haesun Joung
Kyogu Lee
45
0
0
13 Feb 2025
Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning
Unbiased Sliced Wasserstein Kernels for High-Quality Audio Captioning
Manh Luong
Khai Nguyen
Dinh Q. Phung
Gholamreza Haffari
Lizhen Qu
47
0
0
08 Feb 2025
Audio-Language Models for Audio-Centric Tasks: A survey
Yi Su
Jisheng Bai
Qisheng Xu
Kele Xu
Yong Dou
AuLLM
99
2
0
28 Jan 2025
Hybrid Losses for Hierarchical Embedding Learning
Hybrid Losses for Hierarchical Embedding Learning
Haokun Tian
Stefan Lattner
Brian McFee
Charalampos Saitis
43
0
0
22 Jan 2025
AudioBERT: Audio Knowledge Augmented Language Model
AudioBERT: Audio Knowledge Augmented Language Model
Hyunjong Ok
Suho Yoo
Jaeho Lee
AuLLM
RALM
VLM
42
0
0
17 Jan 2025
COCOLA: Coherence-Oriented Contrastive Learning of Musical Audio Representations
COCOLA: Coherence-Oriented Contrastive Learning of Musical Audio Representations
Ruben Ciranni
Emilian Postolache
Giorgio Mariani
Michele Mancusi
Giorgio Fabbro
Emanuele Rodolà
Luca Cosmo
59
7
0
10 Jan 2025
OneLLM: One Framework to Align All Modalities with Language
OneLLM: One Framework to Align All Modalities with Language
Jiaming Han
Kaixiong Gong
Yiyuan Zhang
Jiaqi Wang
Kaipeng Zhang
D. Lin
Yu Qiao
Peng Gao
Xiangyu Yue
MLLM
104
107
0
10 Jan 2025
Audio-Language Datasets of Scenes and Events: A Survey
Audio-Language Datasets of Scenes and Events: A Survey
Gijs Wijngaard
Elia Formisano
Michele Esposito
M. Dumontier
79
2
0
10 Jan 2025
FlowSep: Language-Queried Sound Separation with Rectified Flow Matching
FlowSep: Language-Queried Sound Separation with Rectified Flow Matching
Yi Yuan
Xubo Liu
Haohe Liu
Mark D. Plumbley
Wenwu Wang
52
3
0
10 Jan 2025
Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions
Sound-VECaps: Improving Audio Generation with Visual Enhanced Captions
Yi Yuan
Dongya Jia
Xiaobin Zhuang
Yuanzhe Chen
Zhengxi Liu
...
Y. Wang
Xubo Liu
Xiyuan Kang
Mark D. Plumbley
Wenwu Wang
VLM
48
4
0
03 Jan 2025
Text2midi: Generating Symbolic Music from Captions
Text2midi: Generating Symbolic Music from Captions
Keshav Bhandari
Abhinaba Roy
Kyra Wang
Geeta Puri
Simon Colton
Dorien Herremans
69
1
0
03 Jan 2025
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow Matching and Clap-Ranked Preference Optimization
Chia-Yu Hung
Navonil Majumder
Zhifeng Kong
Ambuj Mehrish
Rafael Valle
Bryan Catanzaro
Soujanya Poria
Bryan Catanzaro
Soujanya Poria
52
5
0
30 Dec 2024
Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation
  Under Semantic Guidance
Smooth-Foley: Creating Continuous Sound for Video-to-Audio Generation Under Semantic Guidance
Yaoyun Zhang
Xuenan Xu
Mengyue Wu
VGen
26
0
0
24 Dec 2024
Multiple Consistency-guided Test-Time Adaptation for Contrastive
  Audio-Language Models with Unlabeled Audio
Multiple Consistency-guided Test-Time Adaptation for Contrastive Audio-Language Models with Unlabeled Audio
Gongyu Chen
Haomin Zhang
Chaofan Ding
Zihao Chen
Xinhan Di
35
0
0
23 Dec 2024
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
MMAudio: Taming Multimodal Joint Training for High-Quality Video-to-Audio Synthesis
Ho Kei Cheng
Masato Ishii
Akio Hayakawa
Takashi Shibuya
A. Schwing
Yuki Mitsufuji
VGen
123
12
0
19 Dec 2024
SongEditor: Adapting Zero-Shot Song Generation Language Model as a Multi-Task Editor
SongEditor: Adapting Zero-Shot Song Generation Language Model as a Multi-Task Editor
Chenyu Yang
Shuai Wang
Hangting Chen
Jianwei Yu
Wei Tan
Rongzhi Gu
Y. Xu
Yizhi Zhou
Haina Zhu
H. Li
KELM
97
1
0
18 Dec 2024
Multimodal Class-aware Semantic Enhancement Network for Audio-Visual
  Video Parsing
Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing
Pengcheng Zhao
Jinxing Zhou
Yang Zhao
D. Guo
Yanxiang Chen
80
1
0
15 Dec 2024
1234567
Next