Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2003.03186
Cited By
v1
v2
v3 (latest)
Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning
AAAI Conference on Artificial Intelligence (AAAI), 2020
6 March 2020
Elad Amrani
Rami Ben-Ari
Daniel Rotman
A. Bronstein
Re-assign community
ArXiv (abs)
PDF
HTML
Github (7★)
Papers citing
"Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning"
50 / 77 papers shown
Title
TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval
Information Fusion (Inf. Fusion), 2025
Xiaolun Jing
Genke Yang
Jian Chu
90
2
0
07 Apr 2025
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention
Joe Dhanith
Shravan Venkatraman
Modigari Narendra
Vigya Sharma
Santhosh Malarvannan
233
3
0
20 Feb 2025
OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities
Hanane Azzag
Hanane Azzag
M. Lebbah
ObjD
207
2
0
17 Sep 2024
Stock Movement Prediction with Multimodal Stable Fusion via Gated Cross-Attention Mechanism
Chang Zong
Jian Shao
Weiming Lu
Yueting Zhuang
143
7
0
06 Jun 2024
Unified Video-Language Pre-training with Synchronized Audio
Shentong Mo
Haofan Wang
Huaxia Li
Xu Tang
140
2
0
12 May 2024
VideoDistill: Language-aware Vision Distillation for Video Question Answering
Bo Zou
Chao Yang
Yu Qiao
Chengbin Quan
Youjian Zhao
VGen
135
2
0
01 Apr 2024
REPAIR: Rank Correlation and Noisy Pair Half-replacing with Memory for Noisy Correspondence
IEEE transactions on multimedia (IEEE TMM), 2024
Ruochen Zheng
Jiahao Hong
Changxin Gao
Nong Sang
114
3
0
13 Mar 2024
SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks
Xingning Dong
Qingpei Guo
Tian Gan
Qing Wang
Yue Yu
Xiangyuan Ren
Yuan Cheng
Wei Chu
138
6
0
31 Jan 2024
ViLA: Efficient Video-Language Alignment for Video Question Answering
European Conference on Computer Vision (ECCV), 2023
Xijun Wang
Junbang Liang
Chun-Kai Wang
Kenan Deng
Yu Lou
Ming-Chyuan Lin
Shan Yang
209
19
0
13 Dec 2023
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer
Jacob Zhiyuan Fang
Skyler Zheng
Vasu Sharma
Robinson Piramuthu
VLM
216
1
0
28 Nov 2023
Harvest Video Foundation Models via Efficient Post-Pretraining
Yizhuo Li
Kunchang Li
Yinan He
Yi Wang
Yali Wang
Limin Wang
Yu Qiao
Ping Luo
CLIP
VLM
VGen
166
3
0
30 Oct 2023
Prompt Me Up: Unleashing the Power of Alignments for Multimodal Entity and Relation Extraction
ACM Multimedia (ACM MM), 2023
Xuming Hu
Junzhe Chen
Aiwei Liu
Shiao Meng
Lijie Wen
Philip S. Yu
137
25
0
25 Oct 2023
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
European Conference on Computer Vision (ECCV), 2023
Nina Shvetsova
Anna Kukleva
Xudong Hong
Christian Rupprecht
Bernt Schiele
Hilde Kuehne
171
29
0
07 Oct 2023
Preserving Modality Structure Improves Multi-Modal Learning
IEEE International Conference on Computer Vision (ICCV), 2023
Swetha Sirnam
Mamshad Nayeem Rizve
Nina Shvetsova
Hilde Kuehne
M. Shah
136
10
0
24 Aug 2023
Provable Dynamic Fusion for Low-Quality Multimodal Data
International Conference on Machine Learning (ICML), 2023
Qingyang Zhang
Haitao Wu
Changqing Zhang
Qinghua Hu
Huazhu Fu
Qiufeng Wang
Xi Peng
249
97
0
03 Jun 2023
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding
International Conference on Learning Representations (ICLR), 2023
Yuanhao Xiong
Long Zhao
Boqing Gong
Ming-Hsuan Yang
Florian Schroff
Ting Liu
Cho-Jui Hsieh
Liangzhe Yuan
VLM
125
0
0
28 Mar 2023
Video Question Answering Using CLIP-Guided Visual-Text Attention
International Conference on Information Photonics (ICIP), 2023
Shuhong Ye
Weikai Kong
Chenglin Yao
Jianfeng Ren
Xudong Jiang
126
12
0
06 Mar 2023
Contrastive Video Question Answering via Video Graph Transformer
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Junbin Xiao
Pan Zhou
Angela Yao
Yicong Li
Richang Hong
Shuicheng Yan
Tat-Seng Chua
ViT
166
50
0
27 Feb 2023
Deep Learning for Video-Text Retrieval: a Review
International Journal of Multimedia Information Retrieval (IJMIR), 2023
Cunjuan Zhu
Qi Jia
Wei Chen
Yanming Guo
Yu Liu
142
25
0
24 Feb 2023
STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-training
AAAI Conference on Artificial Intelligence (AAAI), 2023
Weihong Zhong
Mao Zheng
Duyu Tang
Xuan Luo
Heng Gong
Xiaocheng Feng
Bing Qin
202
9
0
20 Feb 2023
Efficient End-to-End Video Question Answering with Pyramidal Multimodal Transformer
AAAI Conference on Artificial Intelligence (AAAI), 2023
Min Peng
Chongyang Wang
Yu Shi
Xiang-Dong Zhou
ViT
142
12
0
04 Feb 2023
Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text Retrieval
AAAI Conference on Artificial Intelligence (AAAI), 2023
Yizhen Chen
Jie Wang
Lijian Lin
Chen Ma
Jin Ma
Ying Shan
VLM
157
32
0
30 Jan 2023
Temporal Perceiving Video-Language Pre-training
Fan Ma
Xiaojie Jin
Heng Wang
Jingjia Huang
Linchao Zhu
Jiashi Feng
Yi Yang
VLM
141
17
0
18 Jan 2023
Learning Trajectory-Word Alignments for Video-Language Tasks
IEEE International Conference on Computer Vision (ICCV), 2023
Xu Yang
Zhang Li
Haiyang Xu
Hanwang Zhang
Qinghao Ye
Chenliang Li
Ming Yan
Yu Zhang
Fei Huang
Songfang Huang
153
7
0
05 Jan 2023
Integrating Multimodal Data for Joint Generative Modeling of Complex Dynamics
International Conference on Machine Learning (ICML), 2022
Manuela Brenner
Florian Hess
G. Koppe
Daniel Durstewitz
341
14
0
15 Dec 2022
Curriculum Learning Meets Weakly Supervised Modality Correlation Learning
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Sijie Mai
Ya Sun
Haifeng Hu
143
4
0
15 Dec 2022
NLIP: Noise-robust Language-Image Pre-training
AAAI Conference on Artificial Intelligence (AAAI), 2022
Runhu Huang
Yanxin Long
Jianhua Han
Hang Xu
Xiwen Liang
Chunjing Xu
Xiaodan Liang
VLM
177
39
0
14 Dec 2022
SimVTP: Simple Video Text Pre-training with Masked Autoencoders
Yue Ma
Tianyu Yang
Yin Shan
Xiu Li
116
30
0
07 Dec 2022
Normalized Contrastive Learning for Text-Video Retrieval
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yookoon Park
Mahmoud Azab
Bo Xiong
Seungwhan Moon
Florian Metze
Gourab Kundu
Kirmani Ahmed
134
12
0
30 Nov 2022
LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal Modeling
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Dongsheng Chen
Chaofan Tao
Lu Hou
Lifeng Shang
Xin Jiang
Qun Liu
VLM
157
19
0
21 Oct 2022
RaP: Redundancy-aware Video-language Pre-training for Text-Video Retrieval
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Xing Wu
Chaochen Gao
Zijia Lin
Zhongyuan Wang
Jizhong Han
Songlin Hu
92
8
0
13 Oct 2022
Learning Transferable Spatiotemporal Representations from Natural Script Knowledge
Computer Vision and Pattern Recognition (CVPR), 2022
Ziyun Zeng
Yuying Ge
Xihui Liu
Bin Chen
Ping Luo
Shutao Xia
Yixiao Ge
AI4TS
125
9
0
30 Sep 2022
Text-Adaptive Multiple Visual Prototype Matching for Video-Text Retrieval
Neural Information Processing Systems (NeurIPS), 2022
Che-Hsien Lin
Ancong Wu
Junwei Liang
Jun Zhang
Wenhang Ge
Wei Zheng
Chunhua Shen
185
37
0
27 Sep 2022
LGDN: Language-Guided Denoising Network for Video-Language Modeling
Neural Information Processing Systems (NeurIPS), 2022
Haoyu Lu
Mingyu Ding
Nanyi Fei
Yuqi Huo
Zhiwu Lu
VLM
207
18
0
23 Sep 2022
MuMUR : Multilingual Multimodal Universal Retrieval
Avinash Madasu
Estelle Aflalo
Gabriela Ben-Melech Stan
Shachar Rosenman
Shao-Yen Tseng
Gedas Bertasius
Vasudev Lal
278
5
0
24 Aug 2022
LocVTP: Video-Text Pre-training for Temporal Localization
European Conference on Computer Vision (ECCV), 2022
Meng Cao
Tianyu Yang
Junwu Weng
Can Zhang
Jue Wang
Yuexian Zou
129
66
0
21 Jul 2022
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
ACM Multimedia (ACM MM), 2022
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Ming Yan
Ji Zhang
Rongrong Ji
CLIP
VLM
167
367
0
15 Jul 2022
Video Graph Transformer for Video Question Answering
European Conference on Computer Vision (ECCV), 2022
Junbin Xiao
Pan Zhou
Tat-Seng Chua
Shuicheng Yan
ViT
351
94
0
12 Jul 2022
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval
Jinbin Bai
Chunhui Liu
Feiyue Ni
Haofan Wang
Mengying Hu
Xiaofeng Guo
Lele Cheng
150
14
0
11 Jul 2022
Self-Supervised Learning for Videos: A Survey
ACM Computing Surveys (ACM CSUR), 2022
Madeline Chantry Schiappa
Yogesh S Rawat
M. Shah
SSL
310
160
0
18 Jun 2022
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
Neural Information Processing Systems (NeurIPS), 2022
Linxi Fan
Guanzhi Wang
Yunfan Jiang
Ajay Mandlekar
Yuncong Yang
Haoyi Zhu
Andrew Tang
De-An Huang
Yuke Zhu
Anima Anandkumar
LM&Ro
269
464
0
17 Jun 2022
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models
Neural Information Processing Systems (NeurIPS), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
303
269
0
16 Jun 2022
Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data
Shohreh Deldari
Hao Xue
Aaqib Saeed
Jiayuan He
Daniel V. Smith
Flora D. Salim
AI4TS
139
41
0
06 Jun 2022
Learning to Answer Visual Questions from Web Videos
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
171
38
0
10 May 2022
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
European Conference on Computer Vision (ECCV), 2022
Yuying Ge
Yixiao Ge
Xihui Liu
Alex Jinpeng Wang
Jianping Wu
Ying Shan
Xiaohu Qie
Ping Luo
VLM
117
47
0
26 Apr 2022
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval
Computer Vision and Pattern Recognition (CVPR), 2022
Haoyu Lu
Nanyi Fei
Yuqi Huo
Yizhao Gao
Zhiwu Lu
Jiaxin Wen
CLIP
VLM
157
64
0
15 Apr 2022
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
IEEE Access (IEEE Access), 2022
Jie Jiang
Shaobo Min
Weijie Kong
Dihong Gong
Hongfa Wang
Zhifeng Li
Wei Liu
VLM
199
28
0
07 Apr 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
European Conference on Computer Vision (ECCV), 2022
Yan-Bo Lin
Jie Lei
Joey Tianyi Zhou
Gedas Bertasius
230
48
0
06 Apr 2022
Learning Audio-Video Modalities from Image Captions
European Conference on Computer Vision (ECCV), 2022
Arsha Nagrani
Paul Hongsuck Seo
Bryan Seybold
Anja Hauth
Santiago Manén
Chen Sun
Cordelia Schmid
CLIP
130
94
0
01 Apr 2022
Video-Text Representation Learning via Differentiable Weak Temporal Alignment
Computer Vision and Pattern Recognition (CVPR), 2022
Dohwan Ko
Joonmyung Choi
Juyeon Ko
Shinyeong Noh
Kyoung-Woon On
Eun-Sol Kim
Hyunwoo J. Kim
VGen
AI4TS
110
27
0
31 Mar 2022
1
2
Next