v1v2v3 (latest)

Self-Supervised Learning by Cross-Modal Audio-Video Clustering

Neural Information Processing Systems (NeurIPS), 2019

28 November 2019

Papers citing "Self-Supervised Learning by Cross-Modal Audio-Video Clustering"

50 / 280 papers shown

Title
In-Context Clustering with Large Language Models Ying Wang Mengye Ren Andrew Gordon Wilson 88 0 0 09 Oct 2025
Temporally Heterogeneous Graph Contrastive Learning for Multimodal Acoustic event Classification Yuanjian Chen Yang Xiao Jinjie Huang 68 0 0 18 Sep 2025
Cross-Modal Retrieval with Cauchy-Schwarz Divergence Jiahao Zhang Wenzhe Yin Shujian Yu 48 0 0 15 Sep 2025
VGGSounder: Audio-Visual Evaluations for Foundation Models Daniil Zverev Thaddäus Wiedemer Christian Schroeder de Witt Matthias Bethge Wieland Brendel A. Sophia Koepke AuLLM 159 3 0 11 Aug 2025
Action Dubber: Timing Audible Actions via Inflectional Flow Wenlong Wan Weiying Zheng Tianyi Xiang Guiqing Li Shengfeng He 129 0 0 16 Jun 2025
RADLER: Radar Object Detection Leveraging Semantic 3D City Models and Self-Supervised Radar-Image Learning Yuan Luo Rudolf Hoffmann Yan Xia Olaf Wysocki Benedikt Schwab T. H. Kolbe Zorah Lähner 316 2 0 16 Apr 2025
A Large-Scale Analysis on Contextual Self-Supervised Video Representation Learning Akash Kumar Ashlesha Kumar Vibhav Vineet Yogesh S Rawat SSL 835 3 0 08 Apr 2025
SEVERE++: Evaluating Benchmark Sensitivity in Generalization of Video Representation Learning Fida Mohammad Thoker Letian Jiang Chen Zhao Piyush Bagad Hazel Doughty Bernard Ghanem Cees G. M. Snoek ViT SSL 271 0 0 08 Apr 2025
Multimodal Fusion and Vision-Language Models: A Survey for Robot VisionInformation Fusion (Inf. Fusion), 2025 Xiaofeng Han Shunpeng Chen Zenghuang Fu Zhe Feng Lue Fan ... Li Guo Weiliang Meng Xiaopeng Zhang Rongtao Xu Shibiao Xu 327 32 0 03 Apr 2025
SMILE: Infusing Spatial and Motion Semantics in Masked Video LearningComputer Vision and Pattern Recognition (CVPR), 2025 Fida Mohammad Thoker Letian Jiang Chen Zhao Bernard Ghanem 290 3 0 01 Apr 2025
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition Jongseo Lee Joohyun Chang Dongho Lee Jinwoo Choi 424 0 0 30 Mar 2025
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention Joe Dhanith Shravan Venkatraman Modigari Narendra Vigya Sharma Santhosh Malarvannan 305 5 0 20 Feb 2025
Scaling 4D Representations João Carreira Dilara Gokay Michael King Chuhan Zhang Ignacio Rocco ... Viorica Patraucean Dima Damen Pauline Luc Mehdi S. M. Sajjadi Andrew Zisserman 379 15 0 19 Dec 2024
Enhancing Modality Representation and Alignment for Multimodal Cold-start Active LearningACM Multimedia Asia (MMAsia), 2024 Meng Shen Yake Wei Jianxiong Yin D. Rajan D. Hu Simon See 355 1 0 12 Dec 2024
Sound2Vision: Generating Diverse Visuals from Audio through Cross-Modal Latent Alignment Kim Sung-Bin Arda Senocak Hyunwoo Ha Tae-Hyun Oh DiffM 345 2 0 09 Dec 2024
A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation LearningACM Computing Surveys (ACM CSUR), 2024 Luis Vilaca Yi Yu Paula Vinan 418 1 0 24 Nov 2024
The Sound of Water: Inferring Physical Properties from Pouring LiquidsIEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2024 Piyush Bagad Makarand Tapaswi Cees G. M. Snoek Andrew Zisserman 384 4 0 18 Nov 2024
What to align in multimodal contrastive learning?International Conference on Learning Representations (ICLR), 2024 Benoit Dufumier J. Castillo-Navarro D. Tuia Jean-Philippe Thiran 285 24 0 11 Sep 2024
Enhancing Sound Source Localization via False Negative EliminationIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024 Zengjie Song Jiangshe Zhang Yuxi Wang Junsong Fan Zhaoxiang Zhang 237 3 0 29 Aug 2024
Cross-Modality Clustering-based Self-Labeling for Multimodal Data Classification P. Zyblewski Leandro L. Minku 159 1 0 05 Aug 2024
Advanced Graph Clustering Methods: A Comprehensive and In-Depth Analysis Timothé Watteau Aubin Bonnefoy Simon Illouz-Laurent Joaquim Jusseau Serge Iovleff 134 1 0 12 Jul 2024
Labeling Comic Mischief Content in Online Videos with a Multimodal Hierarchical-Cross-Attention Model Elaheh Baharlouei Mahsa Shafaei Yigeng Zhang Hugo Jair Escalante Thamar Solorio 158 1 0 12 Jun 2024
Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and LanguageComputer Vision and Pattern Recognition (CVPR), 2024 Mark Hamilton Andrew Zisserman John R. Hershey William T. Freeman VLM 265 17 0 09 Jun 2024
Unsupervised Multimodal Clustering for Semantics Discovery in Multimodal Utterances Hanlei Zhang Hua Xu Fei Long Xin Wang Kai Gao 248 5 0 21 May 2024
MVBIND: Self-Supervised Music Recommendation For Videos Via Embedding Space BindingVisual Communications and Image Processing (VCIP), 2024 Jiajie Teng Huiyu Duan Yucheng Zhu Sijing Wu Guangtao Zhai 118 3 0 15 May 2024
Beyond Unimodal Learning: The Importance of Integrating Multiple Modalities for Lifelong Learning F. Sarfraz Bahram Zonooz Elahe Arani CLL 164 4 0 04 May 2024
Learning text-to-video retrieval from image captioning Lucas Ventura Cordelia Schmid Gül Varol 3DV 178 7 0 26 Apr 2024
Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition Marah Halawa Florian Blume Pia Bideau Martin Maier Rasha Abdel Rahman Olaf Hellwich CVBM 152 4 0 16 Apr 2024
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models David Kurzendörfer Otniel-Bogdan Mercea A. Sophia Koepke Zeynep Akata VLM CLIP 134 3 0 09 Apr 2024
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos Changan Chen Kumar Ashutosh Rohit Girdhar David Harwath Kristen Grauman EgoV SSL 224 10 0 08 Apr 2024
Siamese Vision Transformers are Scalable Audio-visual Learners Yan-Bo Lin Gedas Bertasius 187 10 0 28 Mar 2024
Enhancing Video Transformers for Action Understanding with VLM-aided Training Hui Lu Hu Jian Ronald Poppe A. A. Salah 166 5 0 24 Mar 2024
Unsupervised Audio-Visual Segmentation with Modality Alignment Swapnil Bhosale Haosen Yang Helen Treharne Jiangkang Deng Xiatian Zhu VOS 156 8 0 21 Mar 2024
A Unified Optimal Transport Framework for Cross-Modal Retrieval with Noisy Labels Haocheng Han Minnan Luo Huan Liu Fang Nan 221 0 0 20 Mar 2024
HVDistill: Transferring Knowledge from Images to Point Clouds via Unsupervised Hybrid-View DistillationInternational Journal of Computer Vision (IJCV), 2024 Sha Zhang Jiajun Deng Mengwei He Houqiang Li Wanli Ouyang Yanyong Zhang 3DPC 167 12 0 18 Mar 2024
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive LearningInternational Conference on Machine Learning (ICML), 2024 Jongsuk Kim Hyeongkeun Lee Kyeongha Rho Junmo Kim Joon Son Chung 177 11 0 14 Mar 2024
Self Supervised Correlation-based Permutations for Multi-View Clustering Ran Eisenberg Jonathan Svirsky Ofir Lindenbaum 307 2 0 26 Feb 2024
POP-3D: Open-Vocabulary 3D Occupancy Prediction from ImagesNeural Information Processing Systems (NeurIPS), 2024 Antonín Vobecký Oriane Siméoni David Hurych Spyros Gidaris Andrei Bursuc Patrick Pérez Josef Sivic 199 49 0 17 Jan 2024
Collaboratively Self-supervised Video Representation Learning for Action RecognitionIEEE Transactions on Information Forensics and Security (IEEE TIFS), 2024 Jie Zhang Zhifan Wan Lanqing Hu Stephen Lin Shuzhe Wu Shiguang Shan TTA 319 2 0 15 Jan 2024
Motion Guided Token Compression for Efficient Masked Video Modeling Yukun Feng Yangming Shi Fengze Liu Tan Yan 216 0 0 10 Jan 2024
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification Wentao Zhu 215 7 0 08 Jan 2024
Bootstrapping Autonomous Driving Radars with Self-Supervised Learning Yiduo Hao Sohrab Madani Junfeng Guan Mohammed Alloulah Saurabh Gupta Haitham Hassanieh SSL 253 10 0 07 Dec 2023
Modality Mixer Exploiting Complementary Information for Multi-modal Action Recognition Sumin Lee Sangmin Woo Muhammad Adi Nugroho Changick Kim 216 0 0 21 Nov 2023
Multimodal Representation Learning by Alternating Unimodal Adaptation Xiaohui Zhang Jaehong Yoon Mohit Bansal Huaxiu Yao 214 65 0 17 Nov 2023
Sounding Bodies: Modeling 3D Spatial Sound of Humans Using Body Pose and AudioNeural Information Processing Systems (NeurIPS), 2023 Xudong Xu Dejan Marković Jacob Sandakly Todd Keebler Steven Krenn Alexander Richard 113 8 0 01 Nov 2023
Towards Generalized Multi-stage Clustering: Multi-view Self-distillation Jiatai Wang Zhiwei Xu Xin Wang Tao Li 188 1 0 29 Oct 2023
Overview of ImageArg-2023: The First Shared Task in Multimodal Argument Mining Zhexiong Liu Mohamed Elarby Yang Zhong Diane Litman 98 14 0 15 Oct 2023
SeMAnD: Self-Supervised Anomaly Detection in Multimodal Geospatial Datasets Daria Reshetova Swetava Ganguli C. V. K. Iyer Vipul Pandey 151 4 0 26 Sep 2023
Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training Jiangliu Wang Jianbo Jiao Yibing Song Stephen James Zhan Tong Chongjian Ge Pieter Abbeel Yunhui Liu 86 0 0 25 Sep 2023
TMac: Temporal Multi-Modal Graph Learning for Acoustic Event ClassificationACM Multimedia (ACM MM), 2023 Meng Liu K. Liang Dayu Hu Hao Yu Yue Liu Lingyuan Meng Wenxuan Tu Sihang Zhou Xinwang Liu 203 37 0 21 Sep 2023