ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.03557
  4. Cited By
VisualBERT: A Simple and Performant Baseline for Vision and Language

VisualBERT: A Simple and Performant Baseline for Vision and Language

9 August 2019
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
    VLM
ArXiv (abs)PDFHTML

Papers citing "VisualBERT: A Simple and Performant Baseline for Vision and Language"

50 / 1,260 papers shown
ASCIIBench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text
ASCIIBench: Evaluating Language-Model-Based Understanding of Visually-Oriented Text
Kerry Luo
Michael Fu
Joshua Peguero
Husnain Malik
Anvay Patil
Joyce Lin
Megan Van Overborg
Ryan Sarmiento
Kevin Zhu
28
0
0
02 Dec 2025
Affective Multimodal Agents with Proactive Knowledge Grounding for Emotionally Aligned Marketing Dialogue
Affective Multimodal Agents with Proactive Knowledge Grounding for Emotionally Aligned Marketing Dialogue
Lin Yu
Xiaofei Han
Yifei Kang
Chiung-Yi Tseng
Danyang Zhang
Ziqian Bi
Zhimo Han
89
0
0
21 Nov 2025
Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding
Spatio-Temporal Data Enhanced Vision-Language Model for Traffic Scene Understanding
JingTian Ma
Jingyuan Wang
Wayne Xin Zhao
Guoping Liu
Xiang Wen
CLIP
81
0
0
12 Nov 2025
SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention
SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention
Shreyas C. Dhake
Jiayuan Huang
Runlong He
Danyal Z. Khan
E. Mazomenos
Sophia Bano
Hani J. Marcus
Danail Stoyanov
Matthew J. Clarkson
Mobarak I. Hoque
69
0
0
05 Nov 2025
Generating Accurate and Detailed Captions for High-Resolution Images
Generating Accurate and Detailed Captions for High-Resolution Images
Hankyeol Lee
Gawon Seo
Kyounggyu Lee
Dogun Kim
Kyungwoo Song
Jiyoung Jung
MLLMVLM
220
0
0
31 Oct 2025
FOCUS: Efficient Keyframe Selection for Long Video Understanding
FOCUS: Efficient Keyframe Selection for Long Video Understanding
Zirui Zhu
Hailun Xu
Yang Luo
Yong Liu
Kanchan Sarkar
Zhenheng Yang
Yang You
159
0
0
31 Oct 2025
Masked Diffusion Captioning for Visual Feature Learning
Masked Diffusion Captioning for Visual Feature Learning
Chao Feng
Zihao Wei
Andrew Owens
DiffM
257
0
0
30 Oct 2025
MAP4TS: A Multi-Aspect Prompting Framework for Time-Series Forecasting with Large Language Models
MAP4TS: A Multi-Aspect Prompting Framework for Time-Series Forecasting with Large Language Models
Suchan Lee
Jihoon Choi
Sohyeon Lee
Minseok Song
Bong-Gyu Jang
Hwanjo Yu
S. Han
AI4TS
156
0
0
27 Oct 2025
HARMONY: Hidden Activation Representations and Model Output-Aware Uncertainty Estimation for Vision-Language Models
HARMONY: Hidden Activation Representations and Model Output-Aware Uncertainty Estimation for Vision-Language Models
Erum Mushtaq
Zalan Fabian
Yavuz Faruk Bakman
Anil Ramakrishna
Mahdi Soltanolkotabi
Salman Avestimehr
149
3
0
25 Oct 2025
Top-Down Semantic Refinement for Image Captioning
Top-Down Semantic Refinement for Image Captioning
Jusheng Zhang
Kaitong Cai
Jing Yang
Jian Wang
Chengpei Tang
Keze Wang
DiffMMLLMBDL
302
13
0
25 Oct 2025
Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges
Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges
Cheng Huang
Nyima Tashi
Fan Gao
Yutong Liu
J. Li
...
Guojie Tang
Xiangxiang Wang
Jia Zhang
Tsengdar J. Lee
Yongbin Yu
116
0
0
22 Oct 2025
ELMM: Efficient Lightweight Multimodal Large Language Models for Multimodal Knowledge Graph Completion
ELMM: Efficient Lightweight Multimodal Large Language Models for Multimodal Knowledge Graph Completion
Wei Huang
Peining Li
Meiyu Liang
Xu Hou
Junping Du
Yingxia Shao
Guanhua Ye
Wu Liu
Kangkang Lu
Yang Yu
VLM
215
0
0
19 Oct 2025
Does Visual Grounding Enhance the Understanding of Embodied Knowledge in Large Language Models?
Does Visual Grounding Enhance the Understanding of Embodied Knowledge in Large Language Models?
Zhihui Yang
Yupei Wang
Kaijie Mo
Zhe Zhao
Renfen Hu
154
0
0
19 Oct 2025
On the Provable Importance of Gradients for Language-Assisted Image Clustering
On the Provable Importance of Gradients for Language-Assisted Image Clustering
Bo Peng
Jie Lu
G. Zhang
Zhen Fang
VLM
146
0
0
18 Oct 2025
A Multimodal Approach to Heritage Preservation in the Context of Climate Change
A Multimodal Approach to Heritage Preservation in the Context of Climate Change
David Roqui
Adèle Cormier
nistor Grozavu
Ann Bourges
81
0
0
15 Oct 2025
CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization
CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization
Fengling Zhu
Boshi Liu
Jingyu Hua
Sheng Zhong
DiffMAAML
114
0
0
13 Oct 2025
Towards Self-Refinement of Vision-Language Models with Triangular Consistency
Towards Self-Refinement of Vision-Language Models with Triangular Consistency
Yunlong Deng
Guangyi Chen
Tianpei Gu
Lingjing Kong
Yan Li
Zeyu Tang
Kun Zhang
177
2
0
12 Oct 2025
Cooperative Pseudo Labeling for Unsupervised Federated Classification
Cooperative Pseudo Labeling for Unsupervised Federated Classification
Kuangpu Guo
Lijun Sheng
Yongcan Yu
Jian Liang
Zilei Wang
Ran He
FedMLVLM
160
0
0
11 Oct 2025
Unpacking Hateful Memes: Presupposed Context and False Claims
Unpacking Hateful Memes: Presupposed Context and False Claims
Weibin Cai
Jiayu Li
R. Zafarani
107
0
0
11 Oct 2025
Learning from Mistakes: Enhancing Harmful Meme Detection via Misjudgment Risk Patterns
Learning from Mistakes: Enhancing Harmful Meme Detection via Misjudgment Risk Patterns
Wenshuo Wang
Ziyou Jiang
Junjie Wang
Mingyang Li
Jie Huang
Yuekai Huang
Zhiyuan Chang
Feiyan Duan
Qing Wang
160
0
0
10 Oct 2025
Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
Weikai Huang
Jieyu Zhang
Taoyang Jia
Chenhao Zheng
Ziqi Gao
J. S. Park
Winson Han
Ranjay Krishna
226
0
0
10 Oct 2025
Mysteries of the Deep: Role of Intermediate Representations in Out of Distribution Detection
Mysteries of the Deep: Role of Intermediate Representations in Out of Distribution Detection
I. M. De la Jara
C. Rodriguez-Opazo
D. Teney
D. Ranasinghe
E. Abbasnejad
OODD
366
0
0
07 Oct 2025
Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning
Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning
Chashi Mahiul Islam
Oteo Mamo
Samuel Jacob Chacko
Xiuwen Liu
Weikuan Yu
LRM
136
0
0
03 Oct 2025
Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving
Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving
Shunfeng Zheng
Yudi Zhang
Meng Fang
Zihan Zhang
Zhitan Wu
Mykola Pechenizkiy
Ling-Hao Chen
ReLMRALMLRM
242
0
0
01 Oct 2025
Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey
Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey
Yuntao Shou
Tao Meng
Wei Ai
Keqin Li
LRM
202
5
0
29 Sep 2025
Multilingual Vision-Language Models, A Survey
Multilingual Vision-Language Models, A Survey
Andrei-Alexandru Manea
Jindřich Libovický
VLM
147
1
0
26 Sep 2025
Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction
Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction
Lei Hei
Tingjing Liao
Yingxin Pei
Yiyang Qi
Jiaqi Wang
Ruiting Li
Feiliang Ren
145
0
0
25 Sep 2025
Audio-Visual Separation with Hierarchical Fusion and Representation Alignment
Audio-Visual Separation with Hierarchical Fusion and Representation Alignment
Han Hu
Dongheng Lin
Qiming Huang
Yuqi Hou
Hyung Jin Chang
Jianbo Jiao
145
0
0
24 Sep 2025
Pure Vision Language Action (VLA) Models: A Comprehensive Survey
Pure Vision Language Action (VLA) Models: A Comprehensive Survey
Dapeng Zhang
Jin Sun
Chenghui Hu
Xiaoyan Wu
Zhenlong Yuan
R. Zhou
Fei Shen
Qingguo Zhou
LM&Ro
308
15
0
23 Sep 2025
Leveraging NTPs for Efficient Hallucination Detection in VLMs
Leveraging NTPs for Efficient Hallucination Detection in VLMs
Ofir Azachi
Kfir Eliyahu
Eyal El Ani
Rom Himelstein
Roi Reichart
Yuval Pinter
Nitay Calderon
VLM
159
0
0
20 Sep 2025
Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems
Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems
Saeed Amizadeh
Sara Abdali
Yinheng Li
K. Koishida
175
0
0
18 Sep 2025
MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook
MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook
Peng Xu
Shengwu Xiong
Jiajun Zhang
Yaxiong Chen
Bowen Zhou
...
Yang Yang
Yanglin Deng
Yashu Kang
Ye Yuan
Y. Wen
LRM
127
1
0
17 Sep 2025
DepthVision: Enabling Robust Vision-Language Models with GAN-Based LiDAR-to-RGB Synthesis for Autonomous Driving
DepthVision: Enabling Robust Vision-Language Models with GAN-Based LiDAR-to-RGB Synthesis for Autonomous Driving
Sven Kirchner
Nils Purschke
Ross Greer
Alois C. Knoll
3DVVLM
176
0
0
09 Sep 2025
Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models
Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models
Hiroshi Sasaki
VLM
115
0
0
02 Sep 2025
JVLGS: Joint Vision-Language Gas Leak Segmentation
JVLGS: Joint Vision-Language Gas Leak Segmentation
Xinlong Zhao
Qixiang Pang
Shan Du
92
0
0
27 Aug 2025
NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks
NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks
Aritra Dutta
Swapnanil Mukherjee
Deepanway Ghosal
Somak Aditya
VLM
95
0
0
27 Aug 2025
A Multimodal-Multitask Framework with Cross-modal Relation and Hierarchical Interactive Attention for Semantic Comprehension
A Multimodal-Multitask Framework with Cross-modal Relation and Hierarchical Interactive Attention for Semantic ComprehensionInformation Fusion (Inf. Fusion), 2025
Mohammad Zia Ur Rehman
Devraj Raghuvanshi
Umang Jain
Shubhi Bansal
Nagendra Kumar
108
5
0
22 Aug 2025
Checkmate: interpretable and explainable RSVQA is the endgame
Checkmate: interpretable and explainable RSVQA is the endgame
Lucrezia Tosato
Christel Chappuis
Syrielle Montariol
F. Weissgerber
Sylvain Lobry
D. Tuia
151
0
0
18 Aug 2025
BERT-VQA: Visual Question Answering on Plots
BERT-VQA: Visual Question Answering on Plots
Tai Vu
Robert Yang
84
1
0
14 Aug 2025
Text-conditioned State Space Model For Domain-generalized Change Detection Visual Question Answering
Text-conditioned State Space Model For Domain-generalized Change Detection Visual Question Answering
Elman Ghazaei
Erchan Aptoula
200
0
0
12 Aug 2025
Remote Sensing Image Intelligent Interpretation with the Language-Centered Perspective: Principles, Methods and Challenges
Remote Sensing Image Intelligent Interpretation with the Language-Centered Perspective: Principles, Methods and Challenges
Haifeng Li
Wang Guo
Haiyang Wu
Mengwei Wu
Jipeng Zhang
Qing Zhu
Yu Liu
Xin Huang
Chao Tao
141
1
0
09 Aug 2025
A Context-aware Attention and Graph Neural Network-based Multimodal Framework for Misogyny Detection
A Context-aware Attention and Graph Neural Network-based Multimodal Framework for Misogyny DetectionInformation Processing & Management (IPM), 2025
Mohammad Zia Ur Rehman
Sufyaan Zahoor
Areeb Manzoor
Musharaf Maqbool
Nagendra Kumar
120
20
0
07 Aug 2025
DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition
DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition
Haijing Liu
Tao Pu
Hefeng Wu
Keze Wang
Guanbin Li
ObjDVLM
141
1
0
07 Aug 2025
SaviorRec: Semantic-Behavior Alignment for Cold-Start Recommendation
SaviorRec: Semantic-Behavior Alignment for Cold-Start Recommendation
Yining Yao
Ziwei Li
Shuwen Xiao
Boya Du
J. Zhu
Junjun Zheng
Xiangheng Kong
Yuning Jiang
LLMSV
188
0
0
02 Aug 2025
Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation
Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation
Sobhan Asasi
Mohamed Ilyas Lakhal
Ozge Mercanoglu Sincan
Richard Bowden
SLR
202
1
0
31 Jul 2025
Closing the Modality Gap for Mixed Modality Search
Closing the Modality Gap for Mixed Modality Search
Binxu Li
Yuhui Zhang
Xiaohan Wang
Weixin Liang
Ludwig Schmidt
Serena Yeung-Levy
VLM
133
4
0
25 Jul 2025
A Highly Clean Recipe Dataset with Ingredient States Annotation for State Probing Task
A Highly Clean Recipe Dataset with Ingredient States Annotation for State Probing Task
Mashiro Toyooka
Kiyoharu Aizawa
Yoko Yamakata
131
0
0
23 Jul 2025
What if Othello-Playing Language Models Could See?
What if Othello-Playing Language Models Could See?
Xinyi Chen
Yifei Yuan
Jiaang Li
Serge J. Belongie
Maarten de Rijke
Anders Søgaard
LRM
158
0
0
19 Jul 2025
Describe Anything Model for Visual Question Answering on Text-rich Images
Describe Anything Model for Visual Question Answering on Text-rich Images
Yen-Linh Vu
Dinh-Thang Duong
Truong-Binh Duong
Anh-Khoi Nguyen
Thanh-Huy Nguyen
...
Jianhua Xing
Xingjian Li
Tianyang Wang
Ulas Bagci
Min Xu
VLM
280
2
0
16 Jul 2025
Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation
Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation
Wenhao Li
Xiu Su
Jingyi Wu
Feng Yang
Yang-Yang Liu
Yi-Ling Chen
Shan You
Chang Xu
VLM
232
0
0
07 Jul 2025
1234...242526
Next
Page 1 of 26
Pageof 26