Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
1908.03557
Cited By

VisualBERT: A Simple and Performant Baseline for Vision and Language

VisualBERT: A Simple and Performant Baseline for Vision and Language

9 August 2019

Liunian Harold Li

ArXiv (abs)PDF HTML

Papers citing "VisualBERT: A Simple and Performant Baseline for Vision and Language"

50 / 1,260 papers shown

ASCIIBench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text

ASCIIBench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text

Megan Van Overborg

28

0

0

02 Dec 2025

Affective Multimodal Agents with Proactive Knowledge Grounding for Emotionally Aligned Marketing Dialogue

Affective Multimodal Agents with Proactive Knowledge Grounding for Emotionally Aligned Marketing Dialogue

Chiung-Yi Tseng

89

0

0

21 Nov 2025

Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding

Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding

81

0

0

12 Nov 2025

SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention

SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention

Shreyas C. Dhake

Danail Stoyanov

Matthew J. Clarkson

Mobarak I. Hoque

69

0

0

05 Nov 2025

Generating Accurate and Detailed Captions for High-Resolution Images

Generating Accurate and Detailed Captions for High-Resolution Images

220

0

0

31 Oct 2025

FOCUS: Efficient Keyframe Selection for Long Video Understanding

FOCUS: Efficient Keyframe Selection for Long Video Understanding

159

0

0

31 Oct 2025

Masked Diffusion Captioning for Visual Feature Learning

Masked Diffusion Captioning for Visual Feature Learning

257

0

0

30 Oct 2025

MAP4TS: A Multi-Aspect Prompting Framework for Time-Series Forecasting with Large Language Models

MAP4TS: A Multi-Aspect Prompting Framework for Time-Series Forecasting with Large Language Models

156

0

0

27 Oct 2025

HARMONY: Hidden Activation Representations and Model Output-Aware Uncertainty Estimation for Vision-Language Models

HARMONY: Hidden Activation Representations and Model Output-Aware Uncertainty Estimation for Vision-Language Models

Yavuz Faruk Bakman

Anil Ramakrishna

Mahdi Soltanolkotabi

Salman Avestimehr

149

3

0

25 Oct 2025

Top-Down Semantic Refinement for Image Captioning

Top-Down Semantic Refinement for Image Captioning

302

13

0

25 Oct 2025

Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges

Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges

...

Xiangxiang Wang

Tsengdar J. Lee

116

0

0

22 Oct 2025

ELMM: Efficient Lightweight Multimodal Large Language Models for Multimodal Knowledge Graph Completion

ELMM: Efficient Lightweight Multimodal Large Language Models for Multimodal Knowledge Graph Completion

215

0

0

19 Oct 2025

Does Visual Grounding Enhance the Understanding of Embodied Knowledge in Large Language Models?

Does Visual Grounding Enhance the Understanding of Embodied Knowledge in Large Language Models?

154

0

0

19 Oct 2025

On the Provable Importance of Gradients for Language-Assisted Image Clustering

On the Provable Importance of Gradients for Language-Assisted Image Clustering

146

0

0

18 Oct 2025

A Multimodal Approach to Heritage Preservation in the Context of Climate Change

A Multimodal Approach to Heritage Preservation in the Context of Climate Change

81

0

0

15 Oct 2025

CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization

CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization

114

0

0

13 Oct 2025

Towards Self-Refinement of Vision-Language Models with Triangular Consistency

Towards Self-Refinement of Vision-Language Models with Triangular Consistency

177

2

0

12 Oct 2025

Cooperative Pseudo Labeling for Unsupervised Federated Classification

Cooperative Pseudo Labeling for Unsupervised Federated Classification

160

0

0

11 Oct 2025

Unpacking Hateful Memes: Presupposed Context and False Claims

Unpacking Hateful Memes: Presupposed Context and False Claims

107

0

0

11 Oct 2025

Learning from Mistakes: Enhancing Harmful Meme Detection via Misjudgment Risk Patterns

Learning from Mistakes: Enhancing Harmful Meme Detection via Misjudgment Risk Patterns

160

0

0

10 Oct 2025

Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding

Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding

226

0

0

10 Oct 2025

Mysteries of the Deep: Role of Intermediate Representations in Out of Distribution Detection

Mysteries of the Deep: Role of Intermediate Representations in Out of Distribution Detection

I. M. De la Jara

C. Rodriguez-Opazo

366

0

0

07 Oct 2025

Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning

Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning

Chashi Mahiul Islam

Samuel Jacob Chacko

136

0

0

03 Oct 2025

Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving

Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving

Mykola Pechenizkiy

242

0

0

01 Oct 2025

Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey

Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey

202

5

0

29 Sep 2025

Multilingual Vision-Language Models, A Survey

Multilingual Vision-Language Models, A Survey

Andrei-Alexandru Manea

Jindřich Libovický

147

1

0

26 Sep 2025

Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction

Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction

145

0

0

25 Sep 2025

Audio-Visual Separation with Hierarchical Fusion and Representation Alignment

Audio-Visual Separation with Hierarchical Fusion and Representation Alignment

Hyung Jin Chang

145

0

0

24 Sep 2025

Pure Vision Language Action (VLA) Models: A Comprehensive Survey

Pure Vision Language Action (VLA) Models: A Comprehensive Survey

308

15

0

23 Sep 2025

Leveraging NTPs for Efficient Hallucination Detection in VLMs

Leveraging NTPs for Efficient Hallucination Detection in VLMs

159

0

0

20 Sep 2025

Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems

Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems

175

0

0

18 Sep 2025

MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook

...

127

1

0

17 Sep 2025

DepthVision: Enabling Robust Vision-Language Models with GAN-Based LiDAR-to-RGB Synthesis for Autonomous Driving

DepthVision: Enabling Robust Vision-Language Models with GAN-Based LiDAR-to-RGB Synthesis for Autonomous Driving

176

0

0

09 Sep 2025

Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models

Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models

115

0

0

02 Sep 2025

JVLGS: Joint Vision-Language Gas Leak Segmentation

JVLGS: Joint Vision-Language Gas Leak Segmentation

92

0

0

27 Aug 2025

NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks

NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks

Swapnanil Mukherjee

Deepanway Ghosal

95

0

0

27 Aug 2025

A Multimodal-Multitask Framework with Cross-modal Relation and Hierarchical Interactive Attention for Semantic Comprehension

A Multimodal-Multitask Framework with Cross-modal Relation and Hierarchical Interactive Attention for Semantic ComprehensionInformation Fusion (Inf. Fusion), 2025

Mohammad Zia Ur Rehman

Devraj Raghuvanshi

108

5

0

22 Aug 2025

Checkmate: interpretable and explainable RSVQA is the endgame

Checkmate: interpretable and explainable RSVQA is the endgame

Lucrezia Tosato

Christel Chappuis

Syrielle Montariol

151

0

0

18 Aug 2025

BERT-VQA: Visual Question Answering on Plots

BERT-VQA: Visual Question Answering on Plots

84

1

0

14 Aug 2025

Text-conditioned State Space Model For Domain-generalized Change Detection Visual Question Answering

Text-conditioned State Space Model For Domain-generalized Change Detection Visual Question Answering

200

0

0

12 Aug 2025

Remote Sensing Image Intelligent Interpretation with the Language-Centered Perspective: Principles, Methods and Challenges

Remote Sensing Image Intelligent Interpretation with the Language-Centered Perspective: Principles, Methods and Challenges

141

1

0

09 Aug 2025

A Context-aware Attention and Graph Neural Network-based Multimodal Framework for Misogyny Detection

A Context-aware Attention and Graph Neural Network-based Multimodal Framework for Misogyny DetectionInformation Processing & Management (IPM), 2025

Mohammad Zia Ur Rehman

Musharaf Maqbool

120

20

0

07 Aug 2025

DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition

DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition

141

1

0

07 Aug 2025

SaviorRec: Semantic-Behavior Alignment for Cold-Start Recommendation

SaviorRec: Semantic-Behavior Alignment for Cold-Start Recommendation

188

0

0

02 Aug 2025

Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation

Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation

Mohamed Ilyas Lakhal

Ozge Mercanoglu Sincan

202

1

0

31 Jul 2025

Closing the Modality Gap for Mixed Modality Search

Closing the Modality Gap for Mixed Modality Search

Serena Yeung-Levy

133

4

0

25 Jul 2025

A Highly Clean Recipe Dataset with Ingredient States Annotation for State Probing Task

A Highly Clean Recipe Dataset with Ingredient States Annotation for State Probing Task

Mashiro Toyooka

Kiyoharu Aizawa

131

0

0

23 Jul 2025

What if Othello-Playing Language Models Could See?

What if Othello-Playing Language Models Could See?

Serge J. Belongie

Maarten de Rijke

Anders Søgaard

158

0

0

19 Jul 2025

Describe Anything Model for Visual Question Answering on Text-rich Images

Describe Anything Model for Visual Question Answering on Text-rich Images

Dinh-Thang Duong

Truong-Binh Duong

Anh-Khoi Nguyen

Thanh-Huy Nguyen

...

283

2

0

16 Jul 2025

Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation

Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation

232

0

0

07 Jul 2025

1 2 3 4...24 25 26

Page 1 of 26

Pageof 26