Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1804.03641
Cited By
v1
v2 (latest)
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
10 April 2018
Andrew Owens
Alexei A. Efros
SSL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Audio-Visual Scene Analysis with Self-Supervised Multisensory Features"
50 / 491 papers shown
Title
GeminiFusion: Efficient Pixel-wise Multimodal Fusion for Vision Transformer
Ding Jia
Jianyuan Guo
Kai Han
Han Wu
Chao Zhang
Chang Xu
Xinghao Chen
ViT
480
49
0
03 Jun 2024
CLIBD: Bridging Vision and Genomics for Biodiversity Monitoring at Scale
ZeMing Gong
Austin T. Wang
Joakim Bruslund Haurum
Joakim Bruslund Haurum
Graham W. Taylor
Angel X. Chang
Angel X. Chang
614
16
0
27 May 2024
Images that Sound: Composing Images and Sounds on a Single Canvas
Ziyang Chen
Daniel Geng
Andrew Owens
DiffM
365
15
0
20 May 2024
A Survey of Generative Techniques for Spatial-Temporal Data Mining
Qianru Zhang
Haixin Wang
Cheng Long
Liangcai Su
Xingwei He
...
Tailin Wu
Hongzhi Yin
Siu-Ming Yiu
Qi Tian
Christian S. Jensen
AI4TS
189
14
0
15 May 2024
Look Once to Hear: Target Speech Hearing with Noisy Examples
International Conference on Human Factors in Computing Systems (CHI), 2024
Bandhav Veluri
Malek Itani
Tuochao Chen
Takuya Yoshioka
Shyamnath Gollakota
296
32
0
10 May 2024
Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition
Marah Halawa
Florian Blume
Pia Bideau
Martin Maier
Rasha Abdel Rahman
Olaf Hellwich
CVBM
196
4
0
16 Apr 2024
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models
David Kurzendörfer
Otniel-Bogdan Mercea
A. Sophia Koepke
Zeynep Akata
VLM
CLIP
158
3
0
09 Apr 2024
SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos
Changan Chen
Kumar Ashutosh
Rohit Girdhar
David Harwath
Kristen Grauman
EgoV
SSL
232
10
0
08 Apr 2024
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
227
10
0
28 Mar 2024
Robust Active Speaker Detection in Noisy Environments
Siva Sai Nagender Vasireddy
Chenxu Zhang
Xiaohu Guo
Yapeng Tian
342
1
0
27 Mar 2024
Audio-Visual Segmentation via Unlabeled Frame Exploitation
Jinxiang Liu
Yikun Liu
Fei Zhang
Chen Ju
Ya Zhang
Yanfeng Wang
279
25
0
17 Mar 2024
EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning
International Conference on Machine Learning (ICML), 2024
Jongsuk Kim
Hyeongkeun Lee
Kyeongha Rho
Junmo Kim
Joon Son Chung
181
11
0
14 Mar 2024
Multimodal Transformer With a Low-Computational-Cost Guarantee
Sungjin Park
Edward Choi
142
2
0
23 Feb 2024
BBA: Bi-Modal Behavioral Alignment for Reasoning with Large Vision-Language Models
Xueliang Zhao
Xinting Huang
Tingchen Fu
Qintong Li
Shansan Gong
Lemao Liu
Wei Bi
Lingpeng Kong
LRM
254
4
0
21 Feb 2024
Multimodal Action Quality Assessment
Ling-an Zeng
Wei-Shi Zheng
449
30
0
31 Jan 2024
Synchformer: Efficient Synchronization from Sparse Cues
Vladimir E. Iashin
Weidi Xie
Esa Rahtu
Andrew Zisserman
215
50
0
29 Jan 2024
POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images
Neural Information Processing Systems (NeurIPS), 2024
Antonín Vobecký
Oriane Siméoni
David Hurych
Spyros Gidaris
Andrei Bursuc
Patrick Pérez
Josef Sivic
215
49
0
17 Jan 2024
HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition
Information Fusion (Inf. Fusion), 2024
Guoying Zhao
Zheng Lian
Yinan Han
Jianhua Tao
232
60
0
11 Jan 2024
FunnyNet-W: Multimodal Learning of Funny Moments in Videos in the Wild
International Journal of Computer Vision (IJCV), 2024
Zhi-Song Liu
Robin Courant
Vicky Kalogeiton
327
9
0
08 Jan 2024
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification
Wentao Zhu
243
7
0
08 Jan 2024
Balanced Multi-modal Federated Learning via Cross-Modal Infiltration
Yunfeng Fan
Wenchao Xu
Yining Qi
Jiaqi Zhu
Song Guo
189
4
0
31 Dec 2023
Evaluation of Barlow Twins and VICReg self-supervised learning for sound patterns of bird and anuran species
Fábio Felix Dias
M. Ponti
Mílton Cezar Ribeiro
R. Minghim
SSL
115
0
0
18 Dec 2023
Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling
Computer Vision and Pattern Recognition (CVPR), 2023
Shentong Mo
Pedro Morgado
238
29
0
02 Dec 2023
Centre Stage: Centricity-based Audio-Visual Temporal Action Detection
Hanyuan Wang
Majid Mirmehdi
Dima Damen
Toby Perrett
171
3
0
28 Nov 2023
Rethink Cross-Modal Fusion in Weakly-Supervised Audio-Visual Video Parsing
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Yating Xu
Conghui Hu
Gim Hee Lee
155
7
0
14 Nov 2023
Cross-modal Generative Model for Visual-Guided Binaural Stereo Generation
Knowledge-Based Systems (KBS), 2023
Zhaojian Li
Jiangwei Zhong
Yuan Yuan
198
9
0
13 Nov 2023
Sounding Bodies: Modeling 3D Spatial Sound of Humans Using Body Pose and Audio
Neural Information Processing Systems (NeurIPS), 2023
Xudong Xu
Dejan Marković
Jacob Sandakly
Todd Keebler
Steven Krenn
Alexander Richard
121
8
0
01 Nov 2023
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Asmar Nadeem
Adrian Hilton
R. Dawes
Graham A. Thomas
A. Mustafa
264
13
0
25 Oct 2023
Separating Invisible Sounds Toward Universal Audiovisual Scene-Aware Sound Separation
Yiyang Su
Ali Vosoughi
Shijian Deng
Yapeng Tian
Chenliang Xu
208
5
0
18 Oct 2023
GRID: A Platform for General Robot Intelligence Development
Sai H. Vemprala
Shuhang Chen
Abhinav Shukla
Dinesh Narayanan
Ashish Kapoor
235
11
0
02 Oct 2023
Emotional Listener Portrait: Neural Listener Head Generation with Emotion
IEEE International Conference on Computer Vision (ICCV), 2023
Luchuan Song
Guojun Yin
Zhenchao Jin
Xiaoyi Dong
Chenliang Xu
381
18
0
29 Sep 2023
RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation
International Conference on Learning Representations (ICLR), 2023
Samuel Pegg
Kai Li
Xiaolin Hu
379
8
0
29 Sep 2023
M
3
^{3}
3
3D: Learning 3D priors using Multi-Modal Masked Autoencoders for 2D image and video understanding
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Muhammad Abdullah Jamal
Omid Mohareri
3DPC
208
2
0
26 Sep 2023
SeMAnD: Self-Supervised Anomaly Detection in Multimodal Geospatial Datasets
Daria Reshetova
Swetava Ganguli
C. V. K. Iyer
Vipul Pandey
175
4
0
26 Sep 2023
Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training
Jiangliu Wang
Jianbo Jiao
Yibing Song
Stephen James
Zhan Tong
Chongjian Ge
Pieter Abbeel
Yunhui Liu
94
0
0
25 Sep 2023
TalkNCE: Improving Active Speaker Detection with Talk-Aware Contrastive Learning
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Chaeyoung Jung
Suyeon Lee
KiHyun Nam
Kyeongha Rho
You Jin Kim
Youngjoon Jang
Joon Son Chung
157
14
0
21 Sep 2023
A Large-scale Dataset for Audio-Language Representation Learning
ACM Multimedia (ACM MM), 2023
Luoyi Sun
Xuenan Xu
Mengyue Wu
Weidi Xie
327
45
0
20 Sep 2023
Sound Source Localization is All about Cross-Modal Alignment
IEEE International Conference on Computer Vision (ICCV), 2023
Arda Senocak
H. Ryu
Junsik Kim
Tae-Hyun Oh
Hanspeter Pfister
Joon Son Chung
179
30
0
19 Sep 2023
The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Shilong Wu
Chenxi Wang
Hang Chen
Yusheng Dai
Chenyue Zhang
...
Sabato Marco Siniscalchi
O. Scharenborg
Zhong-Qiu Wang
Jia Pan
Jianqing Gao
136
12
0
15 Sep 2023
Enhancing multimodal cooperation via sample-level modality valuation
Computer Vision and Pattern Recognition (CVPR), 2023
Yake Wei
Ruoxuan Feng
Zihe Wang
Di Hu
409
48
0
12 Sep 2023
Text-to-feature diffusion for audio-visual few-shot learning
Otniel-Bogdan Mercea
Thomas Hummel
A. Sophia Koepke
Zeynep Akata
VLM
178
3
0
07 Sep 2023
AdVerb: Visually Guided Audio Dereverberation
IEEE International Conference on Computer Vision (ICCV), 2023
Sanjoy Chowdhury
Sreyan Ghosh
Subhrajyoti Dasgupta
Anton Ratnarajah
Utkarsh Tyagi
Tianyi Zhou
190
18
0
23 Aug 2023
Audiovisual Moments in Time: A Large-Scale Annotated Dataset of Audiovisual Actions
PLoS ONE (PLoS ONE), 2023
Michael Joannou
P. Rotshtein
U. Noppeney
141
1
0
18 Aug 2023
V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models
AAAI Conference on Artificial Intelligence (AAAI), 2023
Heng Wang
Jianbo Ma
Santiago Pascual
Richard Cartwright
Weidong (Tom) Cai
VGen
345
71
0
18 Aug 2023
Induction Network: Audio-Visual Modality Gap-Bridging for Self-Supervised Sound Source Localization
ACM Multimedia (ACM MM), 2023
Tianyu Liu
Peng Zhang
Wei Huang
Yufei Zha
Tao You
Yanni Zhang
SSL
115
4
0
09 Aug 2023
Target Speech Extraction with Conditional Diffusion Model
Interspeech (Interspeech), 2023
Naoyuki Kamo
Marc Delcroix
Tomohiro Nakatan
DiffM
161
28
0
08 Aug 2023
DAVIS: High-Quality Audio-Visual Separation with Generative Diffusion Models
Asian Conference on Computer Vision (ACCV), 2023
Chao Huang
Susan Liang
Yapeng Tian
Anurag Kumar
Chenliang Xu
DiffM
162
9
0
31 Jul 2023
FULLER: Unified Multi-modality Multi-task 3D Perception via Multi-level Gradient Calibration
IEEE International Conference on Computer Vision (ICCV), 2023
Zhiji Huang
Sihao Lin
Guiyu Liu
Mukun Luo
Chao Ye
Hang Xu
Xiaojun Chang
Xiaodan Liang
188
14
0
31 Jul 2023
PEANUT: A Human-AI Collaborative Tool for Annotating Audio-Visual Data
ACM Symposium on User Interface Software and Technology (UIST), 2023
Zheng Zhang
Zheng Ning
Chenliang Xu
Yapeng Tian
Toby Jia-Jun Li
215
11
0
27 Jul 2023
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures
Kun Yuan
V. Srivastav
Tong Yu
Joël L. Lavanchy
J. Marescaux
Pietro Mascagni
Nassir Navab
N. Padoy
651
44
0
27 Jul 2023
Previous
1
2
3
4
5
...
8
9
10
Next