Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2106.13043
Cited By
AudioCLIP: Extending CLIP to Image, Text and Audio
24 June 2021
A. Guzhov
Federico Raue
Jörn Hees
Andreas Dengel
CLIP
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"AudioCLIP: Extending CLIP to Image, Text and Audio"
46 / 46 papers shown
Title
OpenAVS: Training-Free Open-Vocabulary Audio Visual Segmentation with Foundational Models
Shengkai Chen
Yifang Yin
Jinming Cao
Shili Xiang
Zhenguang Liu
Roger Zimmermann
VOS
VLM
37
0
0
30 Apr 2025
Synergy-CLIP: Extending CLIP with Multi-modal Integration for Robust Representation Learning
Sangyeon Cho
Jangyeong Jeon
Mingi Kim
Junyeong Kim
CLIP
VLM
74
0
0
30 Apr 2025
AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors
Ruoxuan Feng
Jiangyu Hu
Wenke Xia
Tianci Gao
Ao Shen
Yuhao Sun
Bin Fang
Di Hu
42
2
0
15 Feb 2025
Audio-Language Datasets of Scenes and Events: A Survey
Gijs Wijngaard
Elia Formisano
Michele Esposito
M. Dumontier
74
2
0
10 Jan 2025
OneLLM: One Framework to Align All Modalities with Language
Jiaming Han
Kaixiong Gong
Yiyuan Zhang
Jiaqi Wang
Kaipeng Zhang
D. Lin
Yu Qiao
Peng Gao
Xiangyu Yue
MLLM
104
102
0
10 Jan 2025
Adversarial Hubness in Multi-Modal Retrieval
Tingwei Zhang
Fnu Suya
Rishi Jha
Collin Zhang
Vitaly Shmatikov
AAML
81
1
0
18 Dec 2024
Expanding Event Modality Applications through a Robust CLIP-Based Encoder
SungHeon Jeong
Hanning Chen
Sanggeon Yun
Suhyeon Cho
Wenjun Huang
Xiangjian Liu
Mohsen Imani
98
1
0
04 Dec 2024
The Sound of Water: Inferring Physical Properties from Pouring Liquids
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
Andrew Zisserman
37
0
0
18 Nov 2024
Past, Present, and Future of Sensor-Based Human Activity Recognition Using Wearables: A Surveying Tutorial on a Still Challenging Task
H. Haresamudram
Chi Ian Tang
Sungho Suh
P. Lukowicz
Thomas Ploetz
74
2
0
11 Nov 2024
A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection
Lam Pham
Phat Lam
Dat Tran
Hieu Tang
Tin Nguyen
Alexander Schindler
Canh Vu
Alexander Polonsky
Canh Vu
44
3
0
23 Sep 2024
From Latent to Engine Manifolds: Analyzing ImageBind's Multimodal Embedding Space
Andrew Hamara
Pablo Rivas
16
1
0
30 Aug 2024
D&M: Enriching E-commerce Videos with Sound Effects by Key Moment Detection and SFX Matching
Jingyu Liu
Minquan Wang
Ye Ma
Bo Wang
Aozhu Chen
Quan Chen
Peng Jiang
Xirong Li
36
1
0
23 Aug 2024
Start from Video-Music Retrieval: An Inter-Intra Modal Loss for Cross Modal Retrieval
Zeyu Chen
Pengfei Zhang
Kai Ye
Wei Dong
Xin Feng
Yana Zhang
33
0
0
28 Jul 2024
Sequential Contrastive Audio-Visual Learning
Ioannis Tsiamas
Santiago Pascual
Chunghsin Yeh
Joan Serra
26
2
0
08 Jul 2024
Bridging Language Gaps in Audio-Text Retrieval
Zhiyong Yan
Heinrich Dinkel
Yongqing Wang
Jizhong Liu
Junbo Zhang
Yujun Wang
Bin Wang
VLM
27
4
0
11 Jun 2024
OmniBind: Teach to Build Unequal-Scale Modality Interaction for Omni-Bind of All
Yuanhuiyi Lyu
Xueye Zheng
Dahun Kim
Lin Wang
32
10
0
25 May 2024
Heterogeneous Contrastive Learning for Foundation Models and Beyond
Lecheng Zheng
Baoyu Jing
Zihao Li
Hanghang Tong
Jingrui He
VLM
24
18
0
30 Mar 2024
Extending Multi-modal Contrastive Representations
Zehan Wang
Ziang Zhang
Luping Liu
Yang Zhao
Haifeng Huang
Tao Jin
Zhou Zhao
19
5
0
13 Oct 2023
MuseChat: A Conversational Music Recommendation System for Videos
Zhikang Dong
Bin Chen
Xiulong Liu
Paweł Polak
Peng Zhang
LRM
24
25
0
10 Oct 2023
Semantic Proximity Alignment: Towards Human Perception-consistent Audio Tagging by Aligning with Label Text Description
Youbin Jeon
Yanzhen Ren
VLM
17
0
0
28 Sep 2023
CLIP-Hand3D: Exploiting 3D Hand Pose Estimation via Context-Aware Prompting
Shaoxiang Guo
Qing Cai
Lin Qi
Junyu Dong
3DH
28
7
0
28 Sep 2023
CWCL: Cross-Modal Transfer with Continuously Weighted Contrastive Loss
R. S. Srinivasa
Jaejin Cho
Chouchang Yang
Yashas Malur Saidutta
Ching Hua Lee
Yilin Shen
Hongxia Jin
VLM
16
8
0
26 Sep 2023
Joint Audio and Speech Understanding
Yuan Gong
Alexander H. Liu
Hongyin Luo
Leonid Karlinsky
James R. Glass
AuLLM
13
65
0
25 Sep 2023
Weakly-supervised Automated Audio Captioning via text only training
Theodoros Kouzelis
V. Katsouros
CLIP
25
6
0
21 Sep 2023
Learning Tri-modal Embeddings for Zero-Shot Soundscape Mapping
Subash Khanal
S. Sastry
A. Dhakal
Nathan Jacobs
23
8
0
19 Sep 2023
ImageBind-LLM: Multi-modality Instruction Tuning
Jiaming Han
Renrui Zhang
Wenqi Shao
Peng Gao
Peng-Tao Xu
...
Yafei Wen
Xiaoxin Chen
Xiangyu Yue
Hongsheng Li
Yu Qiao
MLLM
19
115
0
07 Sep 2023
Cross-Modal Retrieval Meets Inference:Improving Zero-Shot Classification with Cross-Modal Retrieval
Seong-Hoon Eom
Namgyu Ho
Jaehoon Oh
Se-Young Yun
CLIP
VLM
18
0
0
29 Aug 2023
Adversarial Illusions in Multi-Modal Embeddings
Tingwei Zhang
Rishi Jha
Eugene Bagdasaryan
Vitaly Shmatikov
AAML
19
8
0
22 Aug 2023
UniBriVL: Robust Universal Representation and Generation of Audio Driven Diffusion Models
Sen Fang
Bowen Gao
Yangjian Wu
T. Teoh
DiffM
11
1
0
29 Jul 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming Yang
F. Khan
VLM
13
116
0
25 Jul 2023
Adapting Language-Audio Models as Few-Shot Audio Learners
Jinhua Liang
Xubo Liu
Haohe Liu
Huy P Phan
Emmanouil Benetos
Mark D. Plumbley
Wenwu Wang
VLM
17
19
0
28 May 2023
Pengi: An Audio Language Model for Audio Tasks
Soham Deshmukh
Benjamin Elizalde
Rita Singh
Huaming Wang
MLLM
AuLLM
25
155
0
19 May 2023
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Peng Wang
Shijie Wang
Junyang Lin
Shuai Bai
Xiaohuan Zhou
Jingren Zhou
Xinggang Wang
Chang Zhou
VLM
MLLM
ObjD
13
113
0
18 May 2023
Soundini: Sound-Guided Diffusion for Natural Video Editing
Seung Hyun Lee
Si-Yeol Kim
Innfarn Yoo
Feng Yang
Donghyeon Cho
Youngseo Kim
Huiwen Chang
Jinkyu Kim
Sangpil Kim
VGen
DiffM
22
15
0
13 Apr 2023
CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model
Dingkang Liang
Jiahao Xie
Zhikang Zou
Xiaoqing Ye
Wei Xu
Xiang Bai
SSL
CLIP
VLM
21
51
0
09 Apr 2023
Accommodating Audio Modality in CLIP for Multimodal Processing
Ludan Ruan
Anwen Hu
Yuqing Song
Liang Zhang
S. Zheng
Qin Jin
VLM
16
10
0
12 Mar 2023
LidarCLIP or: How I Learned to Talk to Point Clouds
Georg Hess
Adam Tonderski
Christoffer Petersson
Kalle AAstrom
Lennart Svensson
DiffM
16
22
0
13 Dec 2022
TimbreCLIP: Connecting Timbre to Text and Images
Nicolas Jonason
Bob L. T. Sturm
CLIP
19
4
0
21 Nov 2022
Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation
Yusong Wu
K. Chen
Tianyu Zhang
Yuchen Hui
Marianna Nezhurina
Taylor Berg-Kirkpatrick
Shlomo Dubnov
CLIP
17
475
0
12 Nov 2022
Language-Based Audio Retrieval with Converging Tied Layers and Contrastive Loss
Andrew Koh
Chng Eng Siong
16
1
0
29 Jun 2022
BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations
Daisuke Niizumi
Daiki Takeuchi
Yasunori Ohishi
N. Harada
K. Kashino
SSL
19
53
0
15 Apr 2022
VLP: A Survey on Vision-Language Pre-training
Feilong Chen
Duzhen Zhang
Minglun Han
Xiuyi Chen
Jing Shi
Shuang Xu
Bo Xu
VLM
79
208
0
18 Feb 2022
Sound-Guided Semantic Image Manipulation
Seung Hyun Lee
Wonseok Roh
Wonmin Byeon
Sang Ho Yoon
Chanyoung Kim
Jinkyu Kim
Sangpil Kim
DiffM
10
43
0
30 Nov 2021
Wav2CLIP: Learning Robust Audio Representations From CLIP
Ho-Hsiang Wu
Prem Seetharaman
Kundan Kumar
J. P. Bello
CLIP
VLM
11
267
0
21 Oct 2021
Multimodal Self-Supervised Learning of General Audio Representations
Luyu Wang
Pauline Luc
Adrià Recasens
Jean-Baptiste Alayrac
Aaron van den Oord
SSL
70
41
0
26 Apr 2021
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Yin Cui
Boqing Gong
ViT
231
573
0
22 Apr 2021
1