ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2110.11499
  4. Cited By
Wav2CLIP: Learning Robust Audio Representations From CLIP

Wav2CLIP: Learning Robust Audio Representations From CLIP

21 October 2021
Ho-Hsiang Wu
Prem Seetharaman
Kundan Kumar
J. P. Bello
    CLIP
    VLM
ArXivPDFHTML

Papers citing "Wav2CLIP: Learning Robust Audio Representations From CLIP"

50 / 189 papers shown
Title
Audio Mamba: Bidirectional State Space Model for Audio Representation
  Learning
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
Mehmet Hamza Erol
Arda Senocak
Jiu Feng
Joon Son Chung
Mamba
62
18
0
05 Jun 2024
Exploiting LMM-based knowledge for image classification tasks
Exploiting LMM-based knowledge for image classification tasks
Maria Tzelepi
Vasileios Mezaris
VLM
30
3
0
05 Jun 2024
M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose
  Audio-Language Representation
M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation
Daisuke Niizumi
Daiki Takeuchi
Yasunori Ohishi
Noboru Harada
Masahiro Yasuda
Shunsuke Tsubaki
Keisuke Imoto
VLM
36
5
0
04 Jun 2024
Creative Text-to-Audio Generation via Synthesizer Programming
Creative Text-to-Audio Generation via Synthesizer Programming
Manuel Cherep
Nikhil Singh
Jessica Shand
23
3
0
01 Jun 2024
OmniBind: Teach to Build Unequal-Scale Modality Interaction for
  Omni-Bind of All
OmniBind: Teach to Build Unequal-Scale Modality Interaction for Omni-Bind of All
Yuanhuiyi Lyu
Xueye Zheng
Dahun Kim
Lin Wang
32
10
0
25 May 2024
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation
Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation
Shiqi Yang
Zhi-Wei Zhong
Mengjie Zhao
Shusuke Takahashi
Masato Ishii
Takashi Shibuya
Yuki Mitsufuji
43
2
0
23 May 2024
Imagery as Inquiry: Exploring A Multimodal Dataset for Conversational Recommendation
Imagery as Inquiry: Exploring A Multimodal Dataset for Conversational Recommendation
Se-eun Yoon
Hyunsik Jeon
Julian McAuley
38
0
0
23 May 2024
Dance Any Beat: Blending Beats with Visuals in Dance Video Generation
Dance Any Beat: Blending Beats with Visuals in Dance Video Generation
Xuanchen Wang
Heng Wang
Dongnan Liu
Weidong Cai
30
3
0
15 May 2024
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
FreeBind: Free Lunch in Unified Multimodal Space via Knowledge Fusion
Zehan Wang
Ziang Zhang
Xize Cheng
Rongjie Huang
Luping Liu
...
Haifeng Huang
Yang Zhao
Tao Jin
Peng Gao
Zhou Zhao
23
8
0
08 May 2024
T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining
T-CLAP: Temporal-Enhanced Contrastive Language-Audio Pretraining
Yiitan Yuan
Zhuo Chen
Xubo Liu
Haohe Liu
Xuenan Xu
Dongya Jia
Yuanzhe Chen
Mark D. Plumbley
Wenwu Wang
CLIP
VLM
40
9
0
27 Apr 2024
Text-to-Song: Towards Controllable Music Generation Incorporating Vocals
  and Accompaniment
Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment
Zhiqing Hong
Rongjie Huang
Xize Cheng
Yongqi Wang
Ruiqi Li
Fuming You
Zhou Zhao
Zhimeng Zhang
26
7
0
14 Apr 2024
T-VSL: Text-Guided Visual Sound Source Localization in Mixtures
T-VSL: Text-Guided Visual Sound Source Localization in Mixtures
Tanvir Mahmud
Yapeng Tian
Diana Marculescu
42
8
0
02 Apr 2024
Heterogeneous Contrastive Learning for Foundation Models and Beyond
Heterogeneous Contrastive Learning for Foundation Models and Beyond
Lecheng Zheng
Baoyu Jing
Zihao Li
Hanghang Tong
Jingrui He
VLM
24
19
0
30 Mar 2024
Unsupervised Audio-Visual Segmentation with Modality Alignment
Unsupervised Audio-Visual Segmentation with Modality Alignment
Swapnil Bhosale
Haosen Yang
Diptesh Kanojia
Jiangkang Deng
Xiatian Zhu
VOS
30
5
0
21 Mar 2024
N-Modal Contrastive Losses with Applications to Social Media Data in
  Trimodal Space
N-Modal Contrastive Losses with Applications to Social Media Data in Trimodal Space
William Theisen
Walter J. Scheirer
22
1
0
18 Mar 2024
Refining Knowledge Transfer on Audio-Image Temporal Agreement for
  Audio-Text Cross Retrieval
Refining Knowledge Transfer on Audio-Image Temporal Agreement for Audio-Text Cross Retrieval
Shunsuke Tsubaki
Daisuke Niizumi
Daiki Takeuchi
Yasunori Ohishi
Noboru Harada
Keisuke Imoto
19
1
0
16 Mar 2024
uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with
  Unsupervised Audio Mixtures
uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with Unsupervised Audio Mixtures
Afrina Tabassum
Dung N. Tran
Trung D. Q. Dang
Ismini Lourentzou
K. Koishida
40
0
0
14 Mar 2024
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion
  Latent Aligners
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
Yazhou Xing
Yin-Yin He
Zeyue Tian
Xintao Wang
Qifeng Chen
27
49
0
27 Feb 2024
ArcSin: Adaptive ranged cosine Similarity injected noise for
  Language-Driven Visual Tasks
ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks
Yang Liu
Xiaomin Yu
Gongyu Zhang
Christos Bergeles
Prokar Dasgupta
Alejandro Granados
Sebastien Ourselin
40
2
0
27 Feb 2024
M2K-VDG: Model-Adaptive Multimodal Knowledge Anchor Enhanced
  Video-grounded Dialogue Generation
M2K-VDG: Model-Adaptive Multimodal Knowledge Anchor Enhanced Video-grounded Dialogue Generation
Hongcheng Liu
Pingjie Wang
Yu Wang
Yanfeng Wang
33
1
0
19 Feb 2024
Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model
  via Cross-modal Alignment
Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment
Angelos Zavras
Dimitrios Michail
Begum Demir
Ioannis Papoutsis
VLM
22
11
0
15 Feb 2024
Listen, Chat, and Edit: Text-Guided Soundscape Modification for Enhanced
  Auditory Experience
Listen, Chat, and Edit: Text-Guided Soundscape Modification for Enhanced Auditory Experience
Xilin Jiang
Cong Han
Yinghao Aaron Li
N. Mesgarani
KELM
26
4
0
06 Feb 2024
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model
  for Multimodal Processing
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
Xianghu Yue
Xiaohai Tian
Lu Lu
Malu Zhang
Zhizheng Wu
Haizhou Li
28
0
0
22 Jan 2024
Learning Audio Concepts from Counterfactual Natural Language
Learning Audio Concepts from Counterfactual Natural Language
A. Vosoughi
Luca Bondi
Ho-Hsiang Wu
Chenliang Xu
CML
45
2
0
10 Jan 2024
FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild
FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild
Zhi-Song Liu
Robin Courant
Vicky Kalogeiton
30
6
0
08 Jan 2024
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
Wenxi Chen
Yuzhe Liang
Ziyang Ma
Zhisheng Zheng
Xie Chen
ViT
46
17
0
07 Jan 2024
Structural Information Guided Multimodal Pre-training for
  Vehicle-centric Perception
Structural Information Guided Multimodal Pre-training for Vehicle-centric Perception
Xiao Wang
Wentao Wu
Chenglong Li
Zhicheng Zhao
Zhe Chen
Yukai Shi
Jin Tang
38
4
0
15 Dec 2023
Can CLIP Help Sound Source Localization?
Can CLIP Help Sound Source Localization?
Sooyoung Park
Arda Senocak
Joon Son Chung
22
6
0
07 Nov 2023
FLAP: Fast Language-Audio Pre-training
FLAP: Fast Language-Audio Pre-training
Ching-Feng Yeh
Po-Yao Huang
Vasu Sharma
Shang-Wen Li
Gargi Ghosh
CLIP
VLM
28
8
0
02 Nov 2023
ATGNN: Audio Tagging Graph Neural Network
ATGNN: Audio Tagging Graph Neural Network
Shubhr Singh
Christian J. Steinmetz
Emmanouil Benetos
Huy P Phan
Dan Stowell
ViT
GNN
11
8
0
02 Nov 2023
Sound of Story: Multi-modal Storytelling with Audio
Sound of Story: Multi-modal Storytelling with Audio
Jaeyeon Bae
Seokhoon Jeong
Seokun Kang
Namgi Han
Jae-Yon Lee
Hyounghun Kim
Taehwan Kim
26
2
0
30 Oct 2023
On the Language Encoder of Contrastive Cross-modal Models
On the Language Encoder of Contrastive Cross-modal Models
Mengjie Zhao
Junya Ono
Zhi-Wei Zhong
Chieh-Hsin Lai
Yuhta Takida
Naoki Murata
Wei-Hsiang Liao
Takashi Shibuya
Hiromi Wakaki
Yuki Mitsufuji
VLM
28
0
0
20 Oct 2023
CLARA: Multilingual Contrastive Learning for Audio Representation
  Acquisition
CLARA: Multilingual Contrastive Learning for Audio Representation Acquisition
K. A. Noriy
Xiaosong Yang
Marcin Budka
Jian Jun Zhang
VLM
21
3
0
18 Oct 2023
Extending Multi-modal Contrastive Representations
Extending Multi-modal Contrastive Representations
Zehan Wang
Ziang Zhang
Luping Liu
Yang Zhao
Haifeng Huang
Tao Jin
Zhou Zhao
19
5
0
13 Oct 2023
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language
  Models
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
Sreyan Ghosh
Ashish Seth
Sonal Kumar
Utkarsh Tyagi
Chandra Kiran Reddy Evuru
S. Ramaneswaran
S. Sakshi
Oriol Nieto
R. Duraiswami
Dinesh Manocha
AuLLM
VLM
CoGe
35
21
0
12 Oct 2023
LLark: A Multimodal Instruction-Following Language Model for Music
LLark: A Multimodal Instruction-Following Language Model for Music
Josh Gardner
Simon Durand
Daniel Stoller
Rachel M. Bittner
AuLLM
23
14
0
11 Oct 2023
MuseChat: A Conversational Music Recommendation System for Videos
MuseChat: A Conversational Music Recommendation System for Videos
Zhikang Dong
Bin Chen
Xiulong Liu
Paweł Polak
Peng Zhang
LRM
37
26
0
10 Oct 2023
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model
  Adaptation
Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation
Guy Yariv
Itai Gat
Sagie Benaim
Lior Wolf
Idan Schwartz
Yossi Adi
DiffM
VGen
29
36
0
28 Sep 2023
Semantic Proximity Alignment: Towards Human Perception-consistent Audio
  Tagging by Aligning with Label Text Description
Semantic Proximity Alignment: Towards Human Perception-consistent Audio Tagging by Aligning with Label Text Description
Youbin Jeon
Yanzhen Ren
VLM
24
0
0
28 Sep 2023
MSG-BART: Multi-granularity Scene Graph-Enhanced Encoder-Decoder
  Language Model for Video-grounded Dialogue Generation
MSG-BART: Multi-granularity Scene Graph-Enhanced Encoder-Decoder Language Model for Video-grounded Dialogue Generation
Hongcheng Liu
Zhe Chen
Hui Li
Pingjie Wang
Yanfeng Wang
Yu Wang
VGen
38
1
0
26 Sep 2023
CWCL: Cross-Modal Transfer with Continuously Weighted Contrastive Loss
CWCL: Cross-Modal Transfer with Continuously Weighted Contrastive Loss
R. S. Srinivasa
Jaejin Cho
Chouchang Yang
Yashas Malur Saidutta
Ching Hua Lee
Yilin Shen
Hongxia Jin
VLM
21
8
0
26 Sep 2023
Online Active Learning For Sound Event Detection
Online Active Learning For Sound Event Detection
Mark Lindsey
Ankit Shah
Francis Kubala
R. M. Stern
22
0
0
25 Sep 2023
A Large-scale Dataset for Audio-Language Representation Learning
A Large-scale Dataset for Audio-Language Representation Learning
Luoyi Sun
Xuenan Xu
Mengyue Wu
Weidi Xie
18
20
0
20 Sep 2023
Sound Source Localization is All about Cross-Modal Alignment
Sound Source Localization is All about Cross-Modal Alignment
Arda Senocak
H. Ryu
Junsik Kim
Tae-Hyun Oh
Hanspeter Pfister
Joon Son Chung
19
18
0
19 Sep 2023
Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping
Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping
Subash Khanal
S. Sastry
A. Dhakal
Nathan Jacobs
31
8
0
19 Sep 2023
Discovering Sounding Objects by Audio Queries for Audio Visual
  Segmentation
Discovering Sounding Objects by Audio Queries for Audio Visual Segmentation
Shaofei Huang
Han Li
Yuqing Wang
Hongji Zhu
Jiao Dai
Jizhong Han
Wenge Rong
Si Liu
VOS
25
16
0
18 Sep 2023
MOSAIC: Learning Unified Multi-Sensory Object Property Representations
  for Robot Learning via Interactive Perception
MOSAIC: Learning Unified Multi-Sensory Object Property Representations for Robot Learning via Interactive Perception
Gyan Tatiya
Jonathan M Francis
Ho-Hsiang Wu
Yonatan Bisk
Jivko Sinapov
29
1
0
15 Sep 2023
Exploring Meta Information for Audio-based Zero-shot Bird Classification
Exploring Meta Information for Audio-based Zero-shot Bird Classification
Alexander Gebhard
Andreas Triantafyllopoulos
Teresa Bez
Lukas Christ
Alexander Kathan
Björn W. Schuller
6
6
0
15 Sep 2023
Audio-free Prompt Tuning for Language-Audio Models
Audio-free Prompt Tuning for Language-Audio Models
Yiming Li
Xiangdong Wang
Hong Liu
CLIP
VLM
14
9
0
15 Sep 2023
Natural Language Supervision for General-Purpose Audio Representations
Natural Language Supervision for General-Purpose Audio Representations
Benjamin Elizalde
Soham Deshmukh
Huaming Wang
AuLLM
AI4TS
19
53
0
11 Sep 2023
Previous
1234
Next