Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2008.05789
Cited By
Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
13 August 2020
Ying Cheng
Ruize Wang
Zhihao Pan
Rui Feng
Yuejie Zhang
SSL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning"
50 / 59 papers shown
Title
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Edson Araujo
Andrew Rouditchenko
Yuan Gong
Saurabhchand Bhati
Samuel Thomas
Brian Kingsbury
Leonid Karlinsky
Rogerio Feris
James Glass
34
0
0
02 May 2025
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention
Joe Dhanith
Shravan Venkatraman
Modigari Narendra
Vigya Sharma
Santhosh Malarvannan
76
0
0
20 Feb 2025
Language-based Audio Retrieval with Co-Attention Networks
Haoran Sun
Z. Wang
Qiuyi Chen
Jianjun Chen
Jia Wang
Haiyang Zhang
34
0
0
31 Dec 2024
Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing
Pengcheng Zhao
Jinxing Zhou
Yang Zhao
D. Guo
Yanxiang Chen
88
2
0
15 Dec 2024
A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning
Luis Vilaca
Yi Yu
Paula Vinan
75
0
0
24 Nov 2024
Interpretable Convolutional SyncNet
Sungjoon Park
Jaesub Yun
Donggeon Lee
Minsik Park
52
0
0
02 Sep 2024
Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification
Mahrukh Awan
Asmar Nadeem
Muhammad Junaid Awan
Armin Mustafa
Syed Sameed Husain
21
1
0
26 Aug 2024
PSM: Learning Probabilistic Embeddings for Multi-scale Zero-Shot Soundscape Mapping
Subash Khanal
Eric Xing
S. Sastry
A. Dhakal
Zhexiao Xiong
Adeel Ahmad
Nathan Jacobs
36
2
0
13 Aug 2024
Progressive Confident Masking Attention Network for Audio-Visual Segmentation
Yuxuan Wang
Feng Dong
Jinchao Zhu
Shuyue Zhu
VOS
43
0
0
04 Jun 2024
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models
David Kurzendörfer
Otniel-Bogdan Mercea
A. Sophia Koepke
Zeynep Akata
VLM
CLIP
26
2
0
09 Apr 2024
A Review of Modern Recommender Systems Using Generative Models (Gen-RecSys)
Yashar Deldjoo
Zhankui He
Julian McAuley
A. Korikov
Scott Sanner
Arnau Ramisa
René Vidal
M. Sathiamoorthy
Atoosa Kasirzadeh
Silvia Milano
VLM
28
40
0
31 Mar 2024
Unsupervised Audio-Visual Segmentation with Modality Alignment
Swapnil Bhosale
Haosen Yang
Diptesh Kanojia
Jiangkang Deng
Xiatian Zhu
VOS
37
5
0
21 Mar 2024
R
2
\text{R}^2
R
2
-Bench: Benchmarking the Robustness of Referring Perception Models under Perturbations
Xiang Li
Kai Qiu
Jinglu Wang
Xiaohao Xu
Rita Singh
Kashu Yamazaki
Hao Chen
Xiaonan Huang
Bhiksha Raj
VOS
40
1
0
07 Mar 2024
Efficient Multiscale Multimodal Bottleneck Transformer for Audio-Video Classification
Wentao Zhu
25
5
0
08 Jan 2024
LAVSS: Location-Guided Audio-Visual Spatial Audio Separation
Yuxin Ye
Wenming Yang
Yapeng Tian
26
10
0
31 Oct 2023
SimMMDG: A Simple and Effective Framework for Multi-modal Domain Generalization
Hao Dong
Ismail Nejjar
Han Sun
Eleni Chatzi
Olga Fink
24
18
0
30 Oct 2023
Cross-modal Cognitive Consensus guided Audio-Visual Segmentation
Zhaofeng Shi
Qingbo Wu
Fanman Meng
Linfeng Xu
Hongliang Li
VOS
30
3
0
10 Oct 2023
QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition
Xiang Li
Jinglu Wang
Xiaohao Xu
Xiulian Peng
Rita Singh
Yan Lu
Bhiksha Raj
VOS
39
10
0
29 Sep 2023
Audiovisual Moments in Time: A Large-Scale Annotated Dataset of Audiovisual Actions
Michael Joannou
P. Rotshtein
U. Noppeney
16
0
0
18 Aug 2023
Improving Audio-Visual Segmentation with Bidirectional Generation
Dawei Hao
Yuxin Mao
Bowen He
Xiaodong Han
Yuchao Dai
Yiran Zhong
VOS
VGen
30
30
0
16 Aug 2023
Multimodal Imbalance-Aware Gradient Modulation for Weakly-supervised Audio-Visual Video Parsing
Jie Fu
Junyu Gao
Changsheng Xu
26
6
0
05 Jul 2023
Looking and Listening: Audio Guided Text Recognition
Wenwen Yu
Mingyu Liu
Biao Yang
Enming Zhang
Deqiang Jiang
Xing Sun
Yuliang Liu
Xiang Bai
DiffM
25
1
0
06 Jun 2023
Flare-Aware Cross-modal Enhancement Network for Multi-spectral Vehicle Re-identification
A. Zheng
Zhiqi Ma
Zi Wang
Chenglong Li
25
2
0
23 May 2023
Self-Supervised Multimodal Learning: A Survey
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
19
43
0
31 Mar 2023
WASD: A Wilder Active Speaker Detection Dataset
Tiago Roxo
Joana Cabral Costa
Pedro R. M. Inácio
Hugo Manuel Proença
14
3
0
09 Mar 2023
Cross-modal Audio-visual Co-learning for Text-independent Speaker Verification
Meng Liu
Kong Aik Lee
Longbiao Wang
Hanyi Zhang
Chang Zeng
J. Dang
23
10
0
22 Feb 2023
Audio-Visual Segmentation with Semantics
Jinxing Zhou
Xuyang Shen
Jianyuan Wang
Jiayi Zhang
Weixuan Sun
...
Stan Birchfield
Dan Guo
Lingpeng Kong
Meng Wang
Yiran Zhong
VOS
40
37
0
30 Jan 2023
iQuery: Instruments as Queries for Audio-Visual Sound Separation
Jiaben Chen
Renrui Zhang
Dongze Lian
Jiaqi Yang
Ziyao Zeng
Jianbo Shi
16
26
0
07 Dec 2022
Contrastive Positive Sample Propagation along the Audio-Visual Event Line
Jinxing Zhou
Dan Guo
Meng Wang
24
54
0
18 Nov 2022
Leveraging the Video-level Semantic Consistency of Event for Audio-visual Event Localization
Yuanyuan Jiang
Jianqin Yin
Yonghao Dang
35
5
0
11 Oct 2022
MM-PCQA: Multi-Modal Learning for No-reference Point Cloud Quality Assessment
Zicheng Zhang
Wei Sun
Xiongkuo Min
Quan-Gen Zhou
Jun He
Qiyuan Wang
Guangtao Zhai
3DPC
18
72
0
01 Sep 2022
AudioScopeV2: Audio-Visual Attention Architectures for Calibrated Open-Domain On-Screen Sound Separation
Efthymios Tzinis
Scott Wisdom
Tal Remez
J. Hershey
31
29
0
20 Jul 2022
Semantic Novelty Detection via Relational Reasoning
Francesco Cappio Borlino
S. Bucci
Tatiana Tommasi
17
4
0
18 Jul 2022
Online Video Instance Segmentation via Robust Context Fusion
Xiang Li
Jinglu Wang
Xiaohao Xu
Bhiksha Raj
Yan Lu
35
5
0
12 Jul 2022
Modality-Aware Contrastive Instance Learning with Self-Distillation for Weakly-Supervised Audio-Visual Violence Detection
Jiashuo Yu
Jin-Yuan Liu
Ying Cheng
Rui Feng
Yuejie Zhang
14
34
0
12 Jul 2022
Audio-Visual Segmentation
Jinxing Zhou
Jianyuan Wang
J. Zhang
Weixuan Sun
Jing Zhang
Stan Birchfield
Dan Guo
Lingpeng Kong
Meng Wang
Yiran Zhong
VOS
28
110
0
11 Jul 2022
Learning Music-Dance Representations through Explicit-Implicit Rhythm Synchronization
Jiashuo Yu
Junfu Pu
Ying Cheng
Rui Feng
Ying Shan
14
5
0
07 Jul 2022
A Comprehensive Survey on Video Saliency Detection with Auditory Information: the Audio-visual Consistency Perceptual is the Key!
Chenglizhao Chen
Mengke Song
Wenfeng Song
Li Guo
Muwei Jian
33
25
0
20 Jun 2022
Past and Future Motion Guided Network for Audio Visual Event Localization
Ting-Yen Chen
Jianqin Yin
Jin Tang
21
2
0
08 May 2022
Self-supervised Contrastive Learning for Audio-Visual Action Recognition
Yang Liu
Y. Tan
Haoyu Lan
SSL
38
5
0
28 Apr 2022
Investigating Modality Bias in Audio Visual Video Parsing
Piyush Singh Pasi
Shubham Nemani
P. Jyothi
Ganesh Ramakrishnan
9
4
0
31 Mar 2022
Balanced Multimodal Learning via On-the-fly Gradient Modulation
Xiaokang Peng
Yake Wei
Andong Deng
Dong Wang
Di Hu
19
193
0
29 Mar 2022
Audio-visual Generalised Zero-shot Learning with Cross-modal Attention and Language
Otniel-Bogdan Mercea
Lukas Riesch
A. Sophia Koepke
Zeynep Akata
19
48
0
07 Mar 2022
Recent Advances and Challenges in Deep Audio-Visual Correlation Learning
Luís Vilacca
Yi Yu
Paula Viana
16
5
0
28 Feb 2022
Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition
Zitian Zhang
Jie M. Zhang
Jian-Shu Zhang
Ming Wu
Xin Fang
Lirong Dai
SSL
33
10
0
15 Feb 2022
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Nina Shvetsova
Brian Chen
Andrew Rouditchenko
Samuel Thomas
Brian Kingsbury
Rogerio Feris
David F. Harwath
James R. Glass
Hilde Kuehne
ViT
25
129
0
08 Dec 2021
Audio-Visual Synchronisation in the wild
Honglie Chen
Weidi Xie
Triantafyllos Afouras
Arsha Nagrani
Andrea Vedaldi
Andrew Zisserman
18
37
0
08 Dec 2021
MM-Pyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing
Jiashuo Yu
Ying Cheng
Ruiwei Zhao
Rui Feng
Yuejie Zhang
24
53
0
24 Nov 2021
Domain Generalization through Audio-Visual Relative Norm Alignment in First Person Action Recognition
M. Planamente
Chiara Plizzari
Emanuele Alberti
Barbara Caputo
EgoV
14
33
0
19 Oct 2021
Multi-Modal Multi-Instance Learning for Retinal Disease Recognition
Xirong Li
Yang Zhou
Jie Wang
Hailan Lin
Jianchun Zhao
Dayong Ding
Weihong Yu
You-xin Chen
17
36
0
25 Sep 2021
1
2
Next