ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2003.03186
  4. Cited By
Noise Estimation Using Density Estimation for Self-Supervised Multimodal
  Learning
v1v2v3 (latest)

Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

AAAI Conference on Artificial Intelligence (AAAI), 2020
6 March 2020
Elad Amrani
Rami Ben-Ari
Daniel Rotman
A. Bronstein
ArXiv (abs)PDFHTMLGithub (7★)

Papers citing "Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning"

50 / 78 papers shown
Backdooring Self-Supervised Contrastive Learning by Noisy Alignment
Backdooring Self-Supervised Contrastive Learning by Noisy Alignment
Tuo Chen
Jie Gui
Minjing Dong
Ju Jia
Lanting Fang
Jian Liu
AAML
171
3
0
19 Aug 2025
TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video Retrieval
TC-MGC: Text-Conditioned Multi-Grained Contrastive Learning for Text-Video RetrievalInformation Fusion (Inf. Fusion), 2025
Xiaolun Jing
Genke Yang
Jian Chu
258
5
0
07 Apr 2025
OneEncoder: A Lightweight Framework for Progressive Alignment of
  Modalities
OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities
Hanane Azzag
Hanane Azzag
M. Lebbah
ObjD
383
3
0
17 Sep 2024
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention
Joe Dhanith
Shravan Venkatraman
Modigari Narendra
Vigya Sharma
504
11
0
26 Jul 2024
Stock Movement Prediction with Multimodal Stable Fusion via Gated
  Cross-Attention Mechanism
Stock Movement Prediction with Multimodal Stable Fusion via Gated Cross-Attention Mechanism
Chang Zong
Jian Shao
Weiming Lu
Yueting Zhuang
300
12
0
06 Jun 2024
Unified Video-Language Pre-training with Synchronized Audio
Unified Video-Language Pre-training with Synchronized Audio
Shentong Mo
Haofan Wang
Huaxia Li
Xu Tang
301
2
0
12 May 2024
VideoDistill: Language-aware Vision Distillation for Video Question
  Answering
VideoDistill: Language-aware Vision Distillation for Video Question Answering
Bo Zou
Chao Yang
Yu Qiao
Chengbin Quan
Youjian Zhao
VGen
272
5
0
01 Apr 2024
REPAIR: Rank Correlation and Noisy Pair Half-replacing with Memory for
  Noisy Correspondence
REPAIR: Rank Correlation and Noisy Pair Half-replacing with Memory for Noisy CorrespondenceIEEE transactions on multimedia (IEEE TMM), 2024
Ruochen Zheng
Jiahao Hong
Changxin Gao
Nong Sang
201
3
0
13 Mar 2024
SNP-S3: Shared Network Pre-training and Significant Semantic
  Strengthening for Various Video-Text Tasks
SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks
Xingning Dong
Qingpei Guo
Tian Gan
Qing Wang
Yue Yu
Xiangyuan Ren
Yuan Cheng
Wei Chu
258
6
0
31 Jan 2024
ViLA: Efficient Video-Language Alignment for Video Question Answering
ViLA: Efficient Video-Language Alignment for Video Question AnsweringEuropean Conference on Computer Vision (ECCV), 2023
Xijun Wang
Junbang Liang
Chun-Kai Wang
Kenan Deng
Yu Lou
Ming-Chyuan Lin
Shan Yang
381
25
0
13 Dec 2023
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with
  Semantic Vector-Quantized Tokenizer
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer
Jacob Zhiyuan Fang
Skyler Zheng
Vasu Sharma
Robinson Piramuthu
VLM
448
1
0
28 Nov 2023
Harvest Video Foundation Models via Efficient Post-Pretraining
Harvest Video Foundation Models via Efficient Post-Pretraining
Yizhuo Li
Kunchang Li
Yinan He
Yi Wang
Yali Wang
Limin Wang
Yu Qiao
Ping Luo
CLIPVLMVGen
403
3
0
30 Oct 2023
Prompt Me Up: Unleashing the Power of Alignments for Multimodal Entity
  and Relation Extraction
Prompt Me Up: Unleashing the Power of Alignments for Multimodal Entity and Relation ExtractionACM Multimedia (ACM MM), 2023
Xuming Hu
Junzhe Chen
Aiwei Liu
Shiao Meng
Lijie Wen
Philip S. Yu
267
31
0
25 Oct 2023
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
HowToCaption: Prompting LLMs to Transform Video Annotations at ScaleEuropean Conference on Computer Vision (ECCV), 2023
Nina Shvetsova
Anna Kukleva
Xudong Hong
Christian Rupprecht
Bernt Schiele
Hilde Kuehne
377
34
0
07 Oct 2023
Preserving Modality Structure Improves Multi-Modal Learning
Preserving Modality Structure Improves Multi-Modal LearningIEEE International Conference on Computer Vision (ICCV), 2023
Swetha Sirnam
Mamshad Nayeem Rizve
Nina Shvetsova
Hilde Kuehne
M. Shah
287
15
0
24 Aug 2023
Provable Dynamic Fusion for Low-Quality Multimodal Data
Provable Dynamic Fusion for Low-Quality Multimodal DataInternational Conference on Machine Learning (ICML), 2023
Qingyang Zhang
Haitao Wu
Changqing Zhang
Qinghua Hu
Huazhu Fu
Qiufeng Wang
Xi Peng
368
133
0
03 Jun 2023
Structured Video-Language Modeling with Temporal Grouping and Spatial
  Grounding
Structured Video-Language Modeling with Temporal Grouping and Spatial GroundingInternational Conference on Learning Representations (ICLR), 2023
Yuanhao Xiong
Long Zhao
Boqing Gong
Ming-Hsuan Yang
Florian Schroff
Ting Liu
Cho-Jui Hsieh
Liangzhe Yuan
VLM
352
0
0
28 Mar 2023
Video Question Answering Using CLIP-Guided Visual-Text Attention
Video Question Answering Using CLIP-Guided Visual-Text AttentionInternational Conference on Information Photonics (ICIP), 2023
Shuhong Ye
Weikai Kong
Chenglin Yao
Jianfeng Ren
Xudong Jiang
268
12
0
06 Mar 2023
Contrastive Video Question Answering via Video Graph Transformer
Contrastive Video Question Answering via Video Graph TransformerIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Junbin Xiao
Pan Zhou
Angela Yao
Yicong Li
Richang Hong
Shuicheng Yan
Tat-Seng Chua
ViT
326
57
0
27 Feb 2023
Deep Learning for Video-Text Retrieval: a Review
Deep Learning for Video-Text Retrieval: a ReviewInternational Journal of Multimedia Information Retrieval (IJMIR), 2023
Cunjuan Zhu
Qi Jia
Wei Chen
Yanming Guo
Yu Liu
257
35
0
24 Feb 2023
STOA-VLP: Spatial-Temporal Modeling of Object and Action for
  Video-Language Pre-training
STOA-VLP: Spatial-Temporal Modeling of Object and Action for Video-Language Pre-trainingAAAI Conference on Artificial Intelligence (AAAI), 2023
Weihong Zhong
Mao Zheng
Duyu Tang
Xuan Luo
Heng Gong
Xiaocheng Feng
Bing Qin
467
9
0
20 Feb 2023
Efficient End-to-End Video Question Answering with Pyramidal Multimodal
  Transformer
Efficient End-to-End Video Question Answering with Pyramidal Multimodal TransformerAAAI Conference on Artificial Intelligence (AAAI), 2023
Min Peng
Chongyang Wang
Yu Shi
Xiang-Dong Zhou
ViT
374
13
0
04 Feb 2023
Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text
  Retrieval
Tagging before Alignment: Integrating Multi-Modal Tags for Video-Text RetrievalAAAI Conference on Artificial Intelligence (AAAI), 2023
Yizhen Chen
Jie Wang
Lijian Lin
Chen Ma
Jin Ma
Ying Shan
VLM
300
37
0
30 Jan 2023
Temporal Perceiving Video-Language Pre-training
Temporal Perceiving Video-Language Pre-training
Fan Ma
Xiaojie Jin
Heng Wang
Jingjia Huang
Linchao Zhu
Jiashi Feng
Yi Yang
VLM
238
18
0
18 Jan 2023
Learning Trajectory-Word Alignments for Video-Language Tasks
Learning Trajectory-Word Alignments for Video-Language TasksIEEE International Conference on Computer Vision (ICCV), 2023
Xu Yang
Zhang Li
Haiyang Xu
Hanwang Zhang
Qinghao Ye
Chenliang Li
Ming Yan
Yu Zhang
Fei Huang
Songfang Huang
270
8
0
05 Jan 2023
Integrating Multimodal Data for Joint Generative Modeling of Complex
  Dynamics
Integrating Multimodal Data for Joint Generative Modeling of Complex DynamicsInternational Conference on Machine Learning (ICML), 2022
Manuela Brenner
Florian Hess
G. Koppe
Daniel Durstewitz
571
17
0
15 Dec 2022
Curriculum Learning Meets Weakly Supervised Modality Correlation
  Learning
Curriculum Learning Meets Weakly Supervised Modality Correlation LearningConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Sijie Mai
Ya Sun
Haifeng Hu
227
4
0
15 Dec 2022
NLIP: Noise-robust Language-Image Pre-training
NLIP: Noise-robust Language-Image Pre-trainingAAAI Conference on Artificial Intelligence (AAAI), 2022
Runhu Huang
Yanxin Long
Jianhua Han
Hang Xu
Xiwen Liang
Chunjing Xu
Xiaodan Liang
VLM
341
42
0
14 Dec 2022
SimVTP: Simple Video Text Pre-training with Masked Autoencoders
SimVTP: Simple Video Text Pre-training with Masked Autoencoders
Yue Ma
Tianyu Yang
Yin Shan
Xiu Li
209
30
0
07 Dec 2022
Normalized Contrastive Learning for Text-Video Retrieval
Normalized Contrastive Learning for Text-Video RetrievalConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Yookoon Park
Mahmoud Azab
Bo Xiong
Seungwhan Moon
Florian Metze
Gourab Kundu
Kirmani Ahmed
206
13
0
30 Nov 2022
LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal
  Modeling
LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal ModelingConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Dongsheng Chen
Chaofan Tao
Lu Hou
Lifeng Shang
Xin Jiang
Qun Liu
VLM
301
19
0
21 Oct 2022
RaP: Redundancy-aware Video-language Pre-training for Text-Video
  Retrieval
RaP: Redundancy-aware Video-language Pre-training for Text-Video RetrievalConference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Xing Wu
Chaochen Gao
Zijia Lin
Zhongyuan Wang
Jizhong Han
Songlin Hu
185
10
0
13 Oct 2022
Learning Transferable Spatiotemporal Representations from Natural Script
  Knowledge
Learning Transferable Spatiotemporal Representations from Natural Script KnowledgeComputer Vision and Pattern Recognition (CVPR), 2022
Ziyun Zeng
Yuying Ge
Xihui Liu
Bin Chen
Ping Luo
Shutao Xia
Yixiao Ge
AI4TS
268
9
0
30 Sep 2022
Text-Adaptive Multiple Visual Prototype Matching for Video-Text
  Retrieval
Text-Adaptive Multiple Visual Prototype Matching for Video-Text RetrievalNeural Information Processing Systems (NeurIPS), 2022
Che-Hsien Lin
Ancong Wu
Junwei Liang
Jun Zhang
Wenhang Ge
Wei Zheng
Chunhua Shen
293
42
0
27 Sep 2022
LGDN: Language-Guided Denoising Network for Video-Language Modeling
LGDN: Language-Guided Denoising Network for Video-Language ModelingNeural Information Processing Systems (NeurIPS), 2022
Haoyu Lu
Mingyu Ding
Nanyi Fei
Yuqi Huo
Zhiwu Lu
VLM
402
20
0
23 Sep 2022
MuMUR : Multilingual Multimodal Universal Retrieval
MuMUR : Multilingual Multimodal Universal Retrieval
Avinash Madasu
Estelle Aflalo
Gabriela Ben-Melech Stan
Shachar Rosenman
Shao-Yen Tseng
Gedas Bertasius
Vasudev Lal
531
6
0
24 Aug 2022
LocVTP: Video-Text Pre-training for Temporal Localization
LocVTP: Video-Text Pre-training for Temporal LocalizationEuropean Conference on Computer Vision (ECCV), 2022
Meng Cao
Tianyu Yang
Junwu Weng
Can Zhang
Jue Wang
Yuexian Zou
228
72
0
21 Jul 2022
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text
  Retrieval
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text RetrievalACM Multimedia (ACM MM), 2022
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Ming Yan
Ji Zhang
Rongrong Ji
CLIPVLM
311
436
0
15 Jul 2022
Video Graph Transformer for Video Question Answering
Video Graph Transformer for Video Question AnsweringEuropean Conference on Computer Vision (ECCV), 2022
Junbin Xiao
Pan Zhou
Tat-Seng Chua
Shuicheng Yan
ViT
535
104
0
12 Jul 2022
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval
LaT: Latent Translation with Cycle-Consistency for Video-Text Retrieval
Jinbin Bai
Chunhui Liu
Feiyue Ni
Haofan Wang
Mengying Hu
Xiaofeng Guo
Lele Cheng
241
15
0
11 Jul 2022
Self-Supervised Learning for Videos: A Survey
Self-Supervised Learning for Videos: A SurveyACM Computing Surveys (ACM CSUR), 2022
Madeline Chantry Schiappa
Yogesh S Rawat
M. Shah
SSL
615
178
0
18 Jun 2022
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale
  Knowledge
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale KnowledgeNeural Information Processing Systems (NeurIPS), 2022
Linxi Fan
Guanzhi Wang
Yunfan Jiang
Ajay Mandlekar
Yuncong Yang
Haoyi Zhu
Andrew Tang
De-An Huang
Yuke Zhu
Anima Anandkumar
LM&Ro
775
534
0
17 Jun 2022
Zero-Shot Video Question Answering via Frozen Bidirectional Language
  Models
Zero-Shot Video Question Answering via Frozen Bidirectional Language ModelsNeural Information Processing Systems (NeurIPS), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
567
285
0
16 Jun 2022
Beyond Just Vision: A Review on Self-Supervised Representation Learning
  on Multimodal and Temporal Data
Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data
Shohreh Deldari
Hao Xue
Aaqib Saeed
Jiayuan He
Daniel V. Smith
Flora D. Salim
AI4TS
294
45
0
06 Jun 2022
Learning to Answer Visual Questions from Web Videos
Learning to Answer Visual Questions from Web VideosIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Antoine Yang
Antoine Miech
Josef Sivic
Ivan Laptev
Cordelia Schmid
ViT
431
42
0
10 May 2022
MILES: Visual BERT Pre-training with Injected Language Semantics for
  Video-text Retrieval
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text RetrievalEuropean Conference on Computer Vision (ECCV), 2022
Yuying Ge
Yixiao Ge
Xihui Liu
Alex Jinpeng Wang
Jianping Wu
Ying Shan
Xiaohu Qie
Ping Luo
VLM
190
48
0
26 Apr 2022
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
  Cross-Modal Retrieval
COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal RetrievalComputer Vision and Pattern Recognition (CVPR), 2022
Haoyu Lu
Nanyi Fei
Yuqi Huo
Yizhao Gao
Zhiwu Lu
Jiaxin Wen
CLIPVLM
284
55
0
15 Apr 2022
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with
  Multi-Level Representations
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level RepresentationsIEEE Access (IEEE Access), 2022
Jie Jiang
Shaobo Min
Weijie Kong
Dihong Gong
Hongfa Wang
Zhifeng Li
Wei Liu
VLM
443
32
0
07 Apr 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
ECLIPSE: Efficient Long-range Video Retrieval using Sight and SoundEuropean Conference on Computer Vision (ECCV), 2022
Yan-Bo Lin
Jie Lei
Joey Tianyi Zhou
Gedas Bertasius
441
57
0
06 Apr 2022
Learning Audio-Video Modalities from Image Captions
Learning Audio-Video Modalities from Image CaptionsEuropean Conference on Computer Vision (ECCV), 2022
Arsha Nagrani
Paul Hongsuck Seo
Bryan Seybold
Anja Hauth
Santiago Manén
Chen Sun
Cordelia Schmid
CLIP
245
98
0
01 Apr 2022
12
Next
Page 1 of 2