v1v2v3v4 (latest)

VL-BERT: Pre-training of Generic Visual-Linguistic Representations

International Conference on Learning Representations (ICLR), 2019

22 August 2019

Weijie Su

ArXiv (abs)PDF HTML Github (740★)

Papers citing "VL-BERT: Pre-training of Generic Visual-Linguistic Representations"

50 / 1,047 papers shown

VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models

Silin Cheng

Kai Han

MLLM VPVLM VLM

349

27 Nov 2025

TOFA: Training-Free One-Shot Federated Adaptation for Vision-Language Models

504

20 Nov 2025

DiffPixelFormer: Differential Pixel-Aware Transformer for RGB-D Indoor Scene Segmentation

158

17 Nov 2025

A Retrospect to Multi-prompt Learning across Vision and LanguageIEEE International Conference on Computer Vision (ICCV), 2023

474

31 Oct 2025

FOCUS: Efficient Keyframe Selection for Long Video Understanding

239

31 Oct 2025

Masked Diffusion Captioning for Visual Feature Learning

348

30 Oct 2025

Structure-Aware Fusion with Progressive Injection for Multimodal Molecular Representation Learning

145

24 Oct 2025

Modest-Align: Data-Efficient Alignment for Vision-Language Models

171

24 Oct 2025

ELMM: Efficient Lightweight Multimodal Large Language Models for Multimodal Knowledge Graph Completion

314

19 Oct 2025

FlexiReID: Adaptive Mixture of Expert for Multi-Modal Person Re-Identification

125

17 Oct 2025

Vision-Centric Activation and Coordination for Multimodal Large Language Models

424

16 Oct 2025

CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization

186

13 Oct 2025

Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding

292

10 Oct 2025

Cluster-Aware Prompt Ensemble Learning for Few-Shot Vision-Language Model AdaptationPattern Recognition (Pattern Recogn.), 2025

236

10 Oct 2025

Multilingual Vision-Language Models, A Survey

Andrei-Alexandru Manea

Jindřich Libovický

VLM

213

26 Sep 2025

Integrating Object Interaction Self-Attention and GAN-Based Debiasing for Visual Question Answering

243

25 Sep 2025

Copycat vs. Original: Multi-modal Pretraining and Variable Importance in Box-office Prediction

Qin Chao

Eunsoo Kim

Boyang Albert Li

185

18 Sep 2025

MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

...

163

17 Sep 2025

Data Leakage in Visual Datasets

297

24 Aug 2025

Checkmate: interpretable and explainable RSVQA is the endgame

231

18 Aug 2025

A Curriculum Learning Approach to Reinforcement Learning: Leveraging RAG for Multimodal Question Answering

218

14 Aug 2025

Remote Sensing Image Intelligent Interpretation with the Language-Centered Perspective: Principles, Methods and Challenges

218

09 Aug 2025

ModalFormer: Multimodal Transformer for Low-Light Image Enhancement

282

27 Jul 2025

VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning

485

20 Jun 2025

Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs

333

13 Jun 2025

Biases Propagate in Encoder-based Vision-Language Models: A Systematic Analysis From Intrinsic Measures to Zero-shot Retrieval OutcomesAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Kshitish Ghate

Tessa E. S. Charlesworth

Mona Diab

Aylin Caliskan

VLM

221

06 Jun 2025

OpenFace 3.0: A Lightweight Multitask System for Comprehensive Facial Behavior AnalysisIEEE International Conference on Automatic Face & Gesture Recognition (FG), 2025

Jiewen Hu

Leena Mathur

Paul Pu Liang

Louis-Philippe Morency

CVBM

233

03 Jun 2025

MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping

389

02 Jun 2025

TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning

369

19 May 2025

A Light and Smart Wearable Platform with Multimodal Foundation Model for Enhanced Spatial Reasoning in People with Blindness and Low Vision

318

16 May 2025

GeoMM: On Geodesic Perspective for Multi-modal LearningComputer Vision and Pattern Recognition (CVPR), 2025

Shibin Mei

Hang Wang

Bingbing Ni

353

16 May 2025

A Survey of Task-Oriented Knowledge Graph Reasoning: Status, Applications, and Prospects

330

27 Apr 2025

Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions

...

615

16 Apr 2025

HAVT-IVD: Heterogeneity-Aware Cross-Modal Network for Audio-Visual Surveillance: Idling Vehicles Detection With Multichannel Audio and Multiscale Visual Cues

Xiwen Li

Ross T. Whitaker

Tolga Tasdizen

424

15 Apr 2025

DiffusionCom: Structure-Aware Multimodal Diffusion Model for Multimodal Knowledge Graph Completion

313

09 Apr 2025

Group-based Distinctive Image Captioning with Memory Difference Encoding and AttentionInternational Journal of Computer Vision (IJCV), 2024

500

03 Apr 2025

UFM: Unified Feature Matching Pre-training with Multi-Modal Image AssistantsPLoS ONE (PLoS ONE), 2025

212

26 Mar 2025

MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question AnsweringAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

515

24 Mar 2025

Seeing What Matters: Empowering CLIP with Patch Generation-to-SelectionComputer Vision and Pattern Recognition (CVPR), 2025

387

21 Mar 2025

DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models

Xirui Zhou

Lianlei Shan

Xiaolin Gui

285

14 Mar 2025

Anatomy-Aware Conditional Image-Text Retrieval

290

10 Mar 2025

MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual RepresentationsComputer Vision and Pattern Recognition (CVPR), 2025

613

02 Mar 2025

FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA

S M Sarwar

586

25 Feb 2025

Vision-Language Models for Edge Networks: A Comprehensive SurveyIEEE Internet of Things Journal (IEEE IoT J.), 2025

407

11 Feb 2025

Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation

268

29 Jan 2025

Omni-RGPT: Unifying Image and Video Region-level Understanding via Token MarksComputer Vision and Pattern Recognition (CVPR), 2025

Subhashree Radhakrishnan

566

14 Jan 2025

Benchmarking Large and Small MLLMs

158

04 Jan 2025

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language TasksNeural Information Processing Systems (NeurIPS), 2024

...

1.0K

145

03 Jan 2025

Towards Visual Grounding: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

1.1K

28 Dec 2024

Cross-Modal Few-Shot Learning with Second-Order Neural Ordinary Differential EquationsAAAI Conference on Artificial Intelligence (AAAI), 2024

Carola-Bibiane Schonlieb

Yuyan Chen

Angelica I Aviles-Rivero

AI4TS

405

20 Dec 2024