VisualBERT: A Simple and Performant Baseline for Vision and Language

9 August 2019

Papers citing "VisualBERT: A Simple and Performant Baseline for Vision and Language"

50 / 1,260 papers shown

FlowTok: Flowing Seamlessly Across Text and Image Tokens

678

13 Mar 2025

Chameleon: Fast-slow Neuro-symbolic Lane Topology ExtractionIEEE International Conference on Robotics and Automation (ICRA), 2025

...

435

10 Mar 2025

Anatomy-Aware Conditional Image-Text Retrieval

283

10 Mar 2025

Exploring Multimodal Perception in Large Language Models Through Perceptual Strength RatingsIEEE Access (IEEE Access), 2025

342

10 Mar 2025

Enhancing Vietnamese VQA through Curriculum Learning on Raw and Augmented Text Representations

273

05 Mar 2025

Vision-Language Model IP Protection via Prompt-based LearningComputer Vision and Pattern Recognition (CVPR), 2025

404

04 Mar 2025

FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA

S M Sarwar

528

25 Feb 2025

Vision Language Models in Medicine

Beria Chingnabe Kalpelbe

Angel Gabriel Adaambiik

Wei Peng

VLM LM&MA

420

24 Feb 2025

ESANS: Effective and Semantic-Aware Negative Sampling for Large-Scale Retrieval SystemsThe Web Conference (WWW), 2025

322

22 Feb 2025

Multi-Turn Multi-Modal Question Clarification for Enhanced Conversational Understanding

Kimia Ramezan

Alireza Amiri Bavandpour

Yifei Yuan

Clemencia Siro

Mohammad Aliannejadi

201

17 Feb 2025

Learning Generalizable Prompt for CLIP with Class Similarity Knowledge

Sehun Jung

Hyang-won Lee

VLM VPVLM

292

17 Feb 2025

Demystifying Hateful Content: Leveraging Large Multimodal Models for Hateful Meme Detection with Explainable DecisionsInternational Conference on Web and Social Media (ICWSM), 2025

Ming Shan Hee

Roy Ka-wei Lee

VLM

330

16 Feb 2025

Vision-Language Models for Edge Networks: A Comprehensive SurveyIEEE Internet of Things Journal (IEEE IoT J.), 2025

402

11 Feb 2025

Multi-Branch Collaborative Learning Network for Video Quality Assessment in Industrial Video SearchKnowledge Discovery and Data Mining (KDD), 2025

323

09 Feb 2025

A Multimodal PDE Foundation Model for Prediction and Scientific Text Descriptions

436

09 Feb 2025

Mitigating GenAI-powered Evidence Pollution for Out-of-Context Multimodal Misinformation Detection

308

24 Jan 2025

MASS: Overcoming Language Bias in Image-Text MatchingAAAI Conference on Artificial Intelligence (AAAI), 2025

238

20 Jan 2025

Leveraging Taxonomy and LLMs for Improved Multimodal Hierarchical ClassificationInternational Conference on Computational Linguistics (COLING), 2025

Shijing Chen

Mohamed Reda Bouadjenek

283

12 Jan 2025

Visual Large Language Models for Generalized and Specialized Applications

499

06 Jan 2025

SAFE-MEME: Structured Reasoning Framework for Robust Hate Speech Detection in Memes

Palash Nandi

Shivam Sharma

Tanmoy Chakraborty

277

31 Dec 2024

MATCHED: Multimodal Authorship-Attribution To Combat Human Trafficking in Escort-Advertisement Data

286

18 Dec 2024

Bringing Multimodality to Amazon Visual Search SystemKnowledge Discovery and Data Mining (KDD), 2024

...

274

17 Dec 2024

Does VLM Classification Benefit from LLM Description Semantics?AAAI Conference on Artificial Intelligence (AAAI), 2024

425

16 Dec 2024

Advances in Transformers for Robotic Applications: A Review

Nikunj Sanghai

Nik Bear Brown

AI4CE

412

13 Dec 2024

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

204

07 Dec 2024

Unified Framework for Open-World Compositional Zero-shot LearningIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024

382

05 Dec 2024

AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?

777

04 Dec 2024

Eyes on the Road: State-of-the-Art Video Question Answering Models Assessment for Traffic Monitoring Tasks

Joseph Raj Vishal

Divesh Basina

Aarya Choudhary

Bharatesh Chakravarthi

438

02 Dec 2024

Exploring Large Vision-Language Models for Robust and Efficient Industrial Anomaly Detection

Kun Qian

Tianyu Sun

Wenhong Wang

248

01 Dec 2024

AlignMamba: Enhancing Multimodal Mamba with Local and Global Cross-modal AlignmentComputer Vision and Pattern Recognition (CVPR), 2024

324

01 Dec 2024

MIMIC: Multimodal Islamophobic Meme Identification and Classification

Safrin Sanzida Islam

Sahid Hossain Mustakim

Sadia Ahmmed

Md. Faiyaz Abdullah Sayeedi

Swapnil Khandoker

Syed Tasdid Azam Dhrubo

Nahid Md Lokman Hossain

251

01 Dec 2024

Approximate Fiber Product: A Preliminary Algebraic-Geometric Perspective on Multimodal Embedding Alignment

Dongfang Zhao

179

30 Nov 2024

VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis

382

27 Nov 2024

Enhancing Few-Shot Out-of-Distribution Detection with Gradient Aligned Context Optimization

274

24 Nov 2024

Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning ScenariosNeural Information Processing Systems (NeurIPS), 2024

401

20 Nov 2024

A Comprehensive Survey on Visual Question Answering Datasets and Algorithms

298

17 Nov 2024

Prompt-enhanced Network for Hateful Meme ClassificationInternational Joint Conference on Artificial Intelligence (IJCAI), 2024

357

12 Nov 2024

Renaissance: Investigating the Pretraining of Vision-Language Encoders

Clayton Fields

C. Kennington

VLM

201

11 Nov 2024

Harmful YouTube Video Detection: A Taxonomy of Online Harm and MLLMs as Alternative Annotators

Claire Jo

Miki Wesołowska

Magdalena Wojcieszak

333

06 Nov 2024

Multimodal Commonsense Knowledge Distillation for Visual Question Answering

154

05 Nov 2024

Can Multimodal Large Language Model Think Analogically?

361

02 Nov 2024

R-LLaVA: Improving Med-VQA Understanding through Visual Region of Interest

841

27 Oct 2024

MAD-Sherlock: Multi-Agent Debate for Visual Misinformation Detection

Christian Schroeder de Witt

336

26 Oct 2024

A Survey of Multimodal Sarcasm DetectionInternational Joint Conference on Artificial Intelligence (IJCAI), 2024

293

24 Oct 2024

Deep Insights into Cognitive Decline: A Survey of Leveraging Non-Intrusive Modalities with Deep Learning TechniquesApplied Soft Computing (Appl. Soft Comput.), 2024

David Ortiz-Perez

Manuel Benavent-Lledo

José García Rodríguez

David Tomás

M. Flores Vizcaya-Moreno

283

24 Oct 2024

Exploiting Text-Image Latent Spaces for the Description of Visual ConceptsInternational Conference on Pattern Recognition (ICPR), 2024

221

23 Oct 2024

Reducing Hallucinations in Vision-Language Models via Latent Space Steering

478

21 Oct 2024

ChitroJera: A Regionally Relevant Visual Question Answering Dataset for Bangla

Deeparghya Dutta Barua

Md Sakib Ul Rahman Sourove

Md Fahim

Fabiha Haider

Fariha Tanjim Shifat

Md Tasmim Rahman Adib

Anam Borhan Uddin

Md Farhan Ishmam

Md Farhad Alam

254

19 Oct 2024

ViConsFormer: Constituting Meaningful Phrases of Scene Texts using Transformer-based Method in Vietnamese Text-based Visual Question AnsweringPacific Asia Conference on Language, Information and Computation (PACLIC), 2024

Nghia Hieu Nguyen

Tho Thanh Quan

Ngan Luu-Thuy Nguyen

273

18 Oct 2024

VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks

Shailaja Keyur Sampat

Yezhou Yang

MLLM CoGe ReLM VLM LRM

233

17 Oct 2024