v1v2 (latest)

Masked Vision and Language Modeling for Multi-modal Representation Learning

International Conference on Learning Representations (ICLR), 2022

3 August 2022

Papers citing "Masked Vision and Language Modeling for Multi-modal Representation Learning"

36 / 36 papers shown

Multilingual Vision-Language Models, A Survey

Andrei-Alexandru Manea

Jindřich Libovický

VLM

147

26 Sep 2025

EmoSLLM: Parameter-Efficient Adaptation of LLMs for Speech Emotion Recognition

Hugo Thimonier

Antony Perzo

Renaud Seguier

145

19 Aug 2025

Boosting Visual Knowledge-Intensive Training for LVLMs Through Causality-Driven Visual Object CompletionInternational Joint Conference on Artificial Intelligence (IJCAI), 2025

126

06 Aug 2025

Distribution-Based Masked Medical Vision-Language Model Using Structured Reports

252

29 Jul 2025

X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP

477

08 May 2025

Investigating the Effect of Parallel Data in the Cross-Lingual Transfer for Vision-Language EncodersInternational Conference on Text, Speech and Dialogue (TSD), 2025

Andrei-Alexandru Manea

Jindřich Libovický

VLM

394

30 Apr 2025

REJEPA: A Novel Joint-Embedding Predictive Architecture for Efficient Remote Sensing Image Retrieval

301

04 Apr 2025

SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding

461

04 Apr 2025

DGTRSD & DGTRS-CLIP: A Dual-Granularity Remote Sensing Image-Text Dataset and Vision Language Foundation Model for AlignmentIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (IEEE J-STARS), 2025

313

25 Mar 2025

CAST: Cross-modal Alignment Similarity Test for Vision Language ModelsInternational Conference on Computational Linguistics (COLING), 2024

238

17 Sep 2024

NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality

Hao Yang

Ashwin Swaminathan

Colin Jon Taylor

194

18 Aug 2024

Masked Image Modeling: A SurveyInternational Journal of Computer Vision (IJCV), 2024

Vlad Hondru

Florinel-Alin Croitoru

Shervin Minaee

Radu Tudor Ionescu

Andrii Zadaianchuk

483

13 Aug 2024

Bootstrapping Vision-language Models for Self-supervised Remote Physiological Measurement

346

11 Jul 2024

Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition

206

09 Jul 2024

PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents

...

239

20 Jun 2024

IntCoOp: Interpretability-Aware Vision-Language Prompt TuningConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Samyadeep Basu

188

19 Jun 2024

ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval from Linguistically Complex Descriptions

...

171

29 May 2024

Enhancing Vision-Language Pre-training with Rich Supervisions

412

05 Mar 2024

Enhancing the vision-language foundation model with key semantic knowledge-emphasized report refinement

Cheng Li

223

21 Jan 2024

Exploring Masked Autoencoders for Sensor-Agnostic Image Retrieval in Remote SensingIEEE Transactions on Geoscience and Remote Sensing (TGRS), 2024

348

15 Jan 2024

Mask Grounding for Referring Image Segmentation

Gao Huang

383

19 Dec 2023

Hulk: A Universal Knowledge Translator for Human-Centric TasksIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

...

718

04 Dec 2023

Point Cloud Self-supervised Learning via 3D to Multi-view Masked Learner

322

17 Nov 2023

FOCAL: Contrastive Learning for Multimodal Time-Series Sensing Signals in Factorized Orthogonal Latent SpaceNeural Information Processing Systems (NeurIPS), 2023

216

30 Oct 2023

VeCLIP: Improving CLIP Training via Visual-enriched CaptionsEuropean Conference on Computer Vision (ECCV), 2023

...

365

11 Oct 2023

Continual Contrastive Spoken Language UnderstandingAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Bhiksha Raj

354

04 Oct 2023

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoEAAAI Conference on Artificial Intelligence (AAAI), 2023

197

23 Aug 2023

Foundational Models Defining a New Era in Vision: A Survey and Outlook

Muhammad Awais

Muzammal Naseer

Salman Khan

Rao Muhammad Anwer

Hisham Cholakkal

434

151

25 Jul 2023

Global and Local Semantic Completion Learning for Vision-Language Pre-trainingIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

Wenzhe Zhao

Hongfa Wang

Yujiu Yang

Wei Liu

VLM

252

12 Jun 2023

Understanding and Constructing Latent Modality Structures in Multi-modal Representation LearningComputer Vision and Pattern Recognition (CVPR), 2023

221

10 Mar 2023

Advancing Radiograph Representation Learning with Masked Record ModelingInternational Conference on Learning Representations (ICLR), 2023

287

30 Jan 2023

Aerial Image Object Detection With Vision Transformer Detector (ViTDet)IEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2023

Liya Wang

A. Tien

414

28 Jan 2023

Scaling Language-Image Pre-training via MaskingComputer Vision and Pattern Recognition (CVPR), 2022

Yanghao Li

Haoqi Fan

Ronghang Hu

Christoph Feichtenhofer

Kaiming He

CLIP VLM

375

393

01 Dec 2022

SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-trainingIEEE International Conference on Computer Vision (ICCV), 2022

Cihang Xie

308

21 Nov 2022

MAMO: Masked Multimodal Modeling for Fine-Grained Vision-Language Representation LearningAnnual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2022

Jing Liu

306

09 Oct 2022

Contrastive Audio-Visual Masked AutoencoderInternational Conference on Learning Representations (ICLR), 2022

396

167

02 Oct 2022