Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1911.12667
Cited By
v1
v2
v3 (latest)
Self-Supervised Learning by Cross-Modal Audio-Video Clustering
Neural Information Processing Systems (NeurIPS), 2019
28 November 2019
Humam Alwassel
D. Mahajan
Bruno Korbar
Lorenzo Torresani
Guohao Li
Du Tran
SSL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Self-Supervised Learning by Cross-Modal Audio-Video Clustering"
50 / 280 papers shown
Title
In-Context Clustering with Large Language Models
Ying Wang
Mengye Ren
Andrew Gordon Wilson
88
0
0
09 Oct 2025
Temporally Heterogeneous Graph Contrastive Learning for Multimodal Acoustic event Classification
Yuanjian Chen
Yang Xiao
Jinjie Huang
68
0
0
18 Sep 2025
Cross-Modal Retrieval with Cauchy-Schwarz Divergence
Jiahao Zhang
Wenzhe Yin
Shujian Yu
48
0
0
15 Sep 2025
VGGSounder: Audio-Visual Evaluations for Foundation Models
Daniil Zverev
Thaddäus Wiedemer
Christian Schroeder de Witt
Matthias Bethge
Wieland Brendel
A. Sophia Koepke
AuLLM
159
3
0
11 Aug 2025
Action Dubber: Timing Audible Actions via Inflectional Flow
Wenlong Wan
Weiying Zheng
Tianyi Xiang
Guiqing Li
Shengfeng He
129
0
0
16 Jun 2025
RADLER: Radar Object Detection Leveraging Semantic 3D City Models and Self-Supervised Radar-Image Learning
Yuan Luo
Rudolf Hoffmann
Yan Xia
Olaf Wysocki
Benedikt Schwab
T. H. Kolbe
Zorah Lähner
316
2
0
16 Apr 2025
A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning
Akash Kumar
Ashlesha Kumar
Vibhav Vineet
Yogesh S Rawat
SSL
835
3
0
08 Apr 2025
SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning
Fida Mohammad Thoker
Letian Jiang
Chen Zhao
Piyush Bagad
Hazel Doughty
Bernard Ghanem
Cees G. M. Snoek
ViT
SSL
271
0
0
08 Apr 2025
Multimodal Fusion and Vision-Language Models: A Survey for Robot Vision
Information Fusion (Inf. Fusion), 2025
Xiaofeng Han
Shunpeng Chen
Zenghuang Fu
Zhe Feng
Lue Fan
...
Li Guo
Weiliang Meng
Xiaopeng Zhang
Rongtao Xu
Shibiao Xu
327
32
0
03 Apr 2025
SMILE: Infusing Spatial and Motion Semantics in Masked Video Learning
Computer Vision and Pattern Recognition (CVPR), 2025
Fida Mohammad Thoker
Letian Jiang
Chen Zhao
Bernard Ghanem
290
3
0
01 Apr 2025
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition
Jongseo Lee
Joohyun Chang
Dongho Lee
Jinwoo Choi
424
0
0
30 Mar 2025
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention
Joe Dhanith
Shravan Venkatraman
Modigari Narendra
Vigya Sharma
Santhosh Malarvannan
305
5
0
20 Feb 2025
Scaling 4D Representations
João Carreira
Dilara Gokay
Michael King
Chuhan Zhang
Ignacio Rocco
...
Viorica Patraucean
Dima Damen
Pauline Luc
Mehdi S. M. Sajjadi
Andrew Zisserman
379
15
0
19 Dec 2024
Enhancing Modality Representation and Alignment for Multimodal Cold-start Active Learning
ACM Multimedia Asia (MMAsia), 2024
Meng Shen
Yake Wei
Jianxiong Yin
D. Rajan
D. Hu
Simon See
355
1
0
12 Dec 2024
Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment
Kim Sung-Bin
Arda Senocak
Hyunwoo Ha
Tae-Hyun Oh
DiffM
345
2
0
09 Dec 2024
A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning
ACM Computing Surveys (ACM CSUR), 2024
Luis Vilaca
Yi Yu
Paula Vinan
418
1
0
24 Nov 2024
The Sound of Water: Inferring Physical Properties from Pouring Liquids
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
Andrew Zisserman
384
4
0
18 Nov 2024
What to align in multimodal contrastive learning?
International Conference on Learning Representations (ICLR), 2024
Benoit Dufumier
J. Castillo-Navarro
D. Tuia
Jean-Philippe Thiran
285
24
0
11 Sep 2024
Enhancing Sound Source Localization via False Negative Elimination
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Zengjie Song
Jiangshe Zhang
Yuxi Wang
Junsong Fan
Zhaoxiang Zhang
237
3
0
29 Aug 2024
Cross-Modality Clustering-based Self-Labeling for Multimodal Data Classification
P. Zyblewski
Leandro L. Minku
159
1
0
05 Aug 2024
Advanced Graph Clustering Methods: A Comprehensive and In-Depth Analysis
Timothé Watteau
Aubin Bonnefoy
Simon Illouz-Laurent
Joaquim Jusseau
Serge Iovleff
134
1
0
12 Jul 2024
Labeling Comic Mischief Content in Online Videos with a Multimodal Hierarchical-Cross-Attention Model
Elaheh Baharlouei
Mahsa Shafaei
Yigeng Zhang
Hugo Jair Escalante
Thamar Solorio
158
1
0
12 Jun 2024
Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
Computer Vision and Pattern Recognition (CVPR), 2024
Mark Hamilton
Andrew Zisserman
John R. Hershey
William T. Freeman
VLM
265
17
0
09 Jun 2024
Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances
Hanlei Zhang
Hua Xu
Fei Long
Xin Wang
Kai Gao
248
5
0
21 May 2024
MVBIND: Self-Supervised Music Recommendation For Videos Via Embedding Space Binding
Visual Communications and Image Processing (VCIP), 2024
Jiajie Teng
Huiyu Duan
Yucheng Zhu
Sijing Wu
Guangtao Zhai
118
3
0
15 May 2024
Beyond Unimodal Learning: The Importance of Integrating Multiple Modalities for Lifelong Learning
F. Sarfraz
Bahram Zonooz
Elahe Arani
CLL
164
4
0
04 May 2024
Learning text-to-video retrieval from image captioning
Lucas Ventura
Cordelia Schmid
Gül Varol
3DV
178
7
0
26 Apr 2024
Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition
Marah Halawa
Florian Blume
Pia Bideau
Martin Maier
Rasha Abdel Rahman
Olaf Hellwich
CVBM
152
4
0
16 Apr 2024
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models
David Kurzendörfer
Otniel-Bogdan Mercea
A. Sophia Koepke
Zeynep Akata
VLM
CLIP
134
3
0
09 Apr 2024
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos
Changan Chen
Kumar Ashutosh
Rohit Girdhar
David Harwath
Kristen Grauman
EgoV
SSL
224
10
0
08 Apr 2024
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
187
10
0
28 Mar 2024
Enhancing Video Transformers for Action Understanding with VLM-aided Training
Hui Lu
Hu Jian
Ronald Poppe
A. A. Salah
166
5
0
24 Mar 2024
Unsupervised Audio-Visual Segmentation with Modality Alignment
Swapnil Bhosale
Haosen Yang
Helen Treharne
Jiangkang Deng
Xiatian Zhu
VOS
156
8
0
21 Mar 2024
A Unified Optimal Transport Framework for Cross-Modal Retrieval with Noisy Labels
Haocheng Han
Minnan Luo
Huan Liu
Fang Nan
221
0
0
20 Mar 2024
HVDistill: Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View Distillation
International Journal of Computer Vision (IJCV), 2024
Sha Zhang
Jiajun Deng
Mengwei He
Houqiang Li
Wanli Ouyang
Yanyong Zhang
3DPC
167
12
0
18 Mar 2024
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning
International Conference on Machine Learning (ICML), 2024
Jongsuk Kim
Hyeongkeun Lee
Kyeongha Rho
Junmo Kim
Joon Son Chung
177
11
0
14 Mar 2024
Self Supervised Correlation-based Permutations for Multi-View Clustering
Ran Eisenberg
Jonathan Svirsky
Ofir Lindenbaum
307
2
0
26 Feb 2024
POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images
Neural Information Processing Systems (NeurIPS), 2024
Antonín Vobecký
Oriane Siméoni
David Hurych
Spyros Gidaris
Andrei Bursuc
Patrick Pérez
Josef Sivic
199
49
0
17 Jan 2024
Collaboratively Self-supervised Video Representation Learning for Action Recognition
IEEE Transactions on Information Forensics and Security (IEEE TIFS), 2024
Jie Zhang
Zhifan Wan
Lanqing Hu
Stephen Lin
Shuzhe Wu
Shiguang Shan
TTA
319
2
0
15 Jan 2024
Motion Guided Token Compression for Efficient Masked Video Modeling
Yukun Feng
Yangming Shi
Fengze Liu
Tan Yan
216
0
0
10 Jan 2024
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification
Wentao Zhu
215
7
0
08 Jan 2024
Bootstrapping Autonomous Driving Radars with Self-Supervised Learning
Yiduo Hao
Sohrab Madani
Junfeng Guan
Mohammed Alloulah
Saurabh Gupta
Haitham Hassanieh
SSL
253
10
0
07 Dec 2023
Modality Mixer Exploiting Complementary Information for Multi-modal Action Recognition
Sumin Lee
Sangmin Woo
Muhammad Adi Nugroho
Changick Kim
216
0
0
21 Nov 2023
Multimodal Representation Learning by Alternating Unimodal Adaptation
Xiaohui Zhang
Jaehong Yoon
Mohit Bansal
Huaxiu Yao
214
65
0
17 Nov 2023
Sounding Bodies: Modeling 3D Spatial Sound of Humans Using Body Pose and Audio
Neural Information Processing Systems (NeurIPS), 2023
Xudong Xu
Dejan Marković
Jacob Sandakly
Todd Keebler
Steven Krenn
Alexander Richard
113
8
0
01 Nov 2023
Towards Generalized Multi-stage Clustering: Multi-view Self-distillation
Jiatai Wang
Zhiwei Xu
Xin Wang
Tao Li
188
1
0
29 Oct 2023
Overview of ImageArg-2023: The First Shared Task in Multimodal Argument Mining
Zhexiong Liu
Mohamed Elarby
Yang Zhong
Diane Litman
98
14
0
15 Oct 2023
SeMAnD: Self-Supervised Anomaly Detection in Multimodal Geospatial Datasets
Daria Reshetova
Swetava Ganguli
C. V. K. Iyer
Vipul Pandey
151
4
0
26 Sep 2023
Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training
Jiangliu Wang
Jianbo Jiao
Yibing Song
Stephen James
Zhan Tong
Chongjian Ge
Pieter Abbeel
Yunhui Liu
86
0
0
25 Sep 2023
TMac: Temporal Multi-Modal Graph Learning for Acoustic Event Classification
ACM Multimedia (ACM MM), 2023
Meng Liu
K. Liang
Dayu Hu
Hao Yu
Yue Liu
Lingyuan Meng
Wenxuan Tu
Sihang Zhou
Xinwang Liu
203
37
0
21 Sep 2023
1
2
3
4
5
6
Next