ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.03557
  4. Cited By
VisualBERT: A Simple and Performant Baseline for Vision and Language

VisualBERT: A Simple and Performant Baseline for Vision and Language

9 August 2019
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
    VLM
ArXiv (abs)PDFHTML

Papers citing "VisualBERT: A Simple and Performant Baseline for Vision and Language"

50 / 1,256 papers shown
Title
SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention
SurgAnt-ViVQA: Learning to Anticipate Surgical Events through GRU-Driven Temporal Cross-Attention
Shreyas C. Dhake
Jiayuan Huang
Runlong He
Danyal Z. Khan
E. Mazomenos
Sophia Bano
Hani J. Marcus
Danail Stoyanov
Matthew J. Clarkson
Mobarak I. Hoque
32
0
0
05 Nov 2025
Generating Accurate and Detailed Captions for High-Resolution Images
Generating Accurate and Detailed Captions for High-Resolution Images
Hankyeol Lee
Gawon Seo
Kyounggyu Lee
Dogun Kim
Kyungwoo Song
Jiyoung Jung
MLLMVLM
165
0
0
31 Oct 2025
FOCUS: Efficient Keyframe Selection for Long Video Understanding
FOCUS: Efficient Keyframe Selection for Long Video Understanding
Zirui Zhu
Hailun Xu
Yang Luo
Yong Liu
Kanchan Sarkar
Zhenheng Yang
Yang You
100
0
0
31 Oct 2025
Masked Diffusion Captioning for Visual Feature Learning
Masked Diffusion Captioning for Visual Feature Learning
Chao Feng
Zihao Wei
Andrew Owens
DiffM
179
0
0
30 Oct 2025
MAP4TS: A Multi-Aspect Prompting Framework for Time-Series Forecasting with Large Language Models
MAP4TS: A Multi-Aspect Prompting Framework for Time-Series Forecasting with Large Language Models
Suchan Lee
Jihoon Choi
Sohyeon Lee
Minseok Song
Bong-Gyu Jang
Hwanjo Yu
S. Han
AI4TS
88
0
0
27 Oct 2025
HARMONY: Hidden Activation Representations and Model Output-Aware Uncertainty Estimation for Vision-Language Models
HARMONY: Hidden Activation Representations and Model Output-Aware Uncertainty Estimation for Vision-Language Models
Erum Mushtaq
Zalan Fabian
Yavuz Faruk Bakman
Anil Ramakrishna
Mahdi Soltanolkotabi
Salman Avestimehr
89
2
0
25 Oct 2025
Top-Down Semantic Refinement for Image Captioning
Top-Down Semantic Refinement for Image Captioning
Jusheng Zhang
Kaitong Cai
Jing Yang
Jian Wang
Chengpei Tang
Keze Wang
DiffMMLLMBDL
242
1
0
25 Oct 2025
Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges
Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges
Cheng Huang
Nyima Tashi
Fan Gao
Yutong Liu
J. Li
...
Guojie Tang
Xiangxiang Wang
Jia Zhang
Tsengdar J. Lee
Yongbin Yu
72
0
0
22 Oct 2025
Does Visual Grounding Enhance the Understanding of Embodied Knowledge in Large Language Models?
Does Visual Grounding Enhance the Understanding of Embodied Knowledge in Large Language Models?
Zhihui Yang
Yupei Wang
Kaijie Mo
Zhe Zhao
Renfen Hu
94
0
0
19 Oct 2025
ELMM: Efficient Lightweight Multimodal Large Language Models for Multimodal Knowledge Graph Completion
ELMM: Efficient Lightweight Multimodal Large Language Models for Multimodal Knowledge Graph Completion
Wei Huang
Peining Li
Meiyu Liang
Xu Hou
Junping Du
Yingxia Shao
Guanhua Ye
Wu Liu
Kangkang Lu
Yang Yu
VLM
68
0
0
19 Oct 2025
On the Provable Importance of Gradients for Language-Assisted Image Clustering
On the Provable Importance of Gradients for Language-Assisted Image Clustering
Bo Peng
Jie Lu
G. Zhang
Zhen Fang
VLM
103
0
0
18 Oct 2025
A Multimodal Approach to Heritage Preservation in the Context of Climate Change
A Multimodal Approach to Heritage Preservation in the Context of Climate Change
David Roqui
Adèle Cormier
nistor Grozavu
Ann Bourges
56
0
0
15 Oct 2025
CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization
CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization
Fengling Zhu
Boshi Liu
Jingyu Hua
Sheng Zhong
DiffMAAML
90
0
0
13 Oct 2025
Towards Self-Refinement of Vision-Language Models with Triangular Consistency
Towards Self-Refinement of Vision-Language Models with Triangular Consistency
Yunlong Deng
Guangyi Chen
Tianpei Gu
Lingjing Kong
Yan Li
Zeyu Tang
Kun Zhang
104
1
0
12 Oct 2025
Cooperative Pseudo Labeling for Unsupervised Federated Classification
Cooperative Pseudo Labeling for Unsupervised Federated Classification
Kuangpu Guo
Lijun Sheng
Yongcan Yu
Jian Liang
Zilei Wang
Ran He
FedMLVLM
100
0
0
11 Oct 2025
Unpacking Hateful Memes: Presupposed Context and False Claims
Unpacking Hateful Memes: Presupposed Context and False Claims
Weibin Cai
Jiayu Li
R. Zafarani
49
0
0
11 Oct 2025
Learning from Mistakes: Enhancing Harmful Meme Detection via Misjudgment Risk Patterns
Learning from Mistakes: Enhancing Harmful Meme Detection via Misjudgment Risk Patterns
Wenshuo Wang
Ziyou Jiang
Junjie Wang
Mingyang Li
Jie Huang
Yuekai Huang
Zhiyuan Chang
Feiyan Duan
Qing Wang
79
0
0
10 Oct 2025
Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
Weikai Huang
Jieyu Zhang
Taoyang Jia
Chenhao Zheng
Ziqi Gao
J. S. Park
Winson Han
Ranjay Krishna
125
0
0
10 Oct 2025
Mysteries of the Deep: Role of Intermediate Representations in Out of Distribution Detection
Mysteries of the Deep: Role of Intermediate Representations in Out of Distribution Detection
I. M. De la Jara
C. Rodriguez-Opazo
D. Teney
D. Ranasinghe
E. Abbasnejad
OODD
267
0
0
07 Oct 2025
Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning
Spatial-ViLT: Enhancing Visual Spatial Reasoning through Multi-Task Learning
Chashi Mahiul Islam
Oteo Mamo
Samuel Jacob Chacko
Xiuwen Liu
Weikuan Yu
LRM
80
0
0
03 Oct 2025
Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving
Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving
Shunfeng Zheng
Yudi Zhang
Meng Fang
Zihan Zhang
Zhitan Wu
Mykola Pechenizkiy
Ling-Hao Chen
ReLMRALMLRM
176
0
0
01 Oct 2025
Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey
Multimodal Large Language Models Meet Multimodal Emotion Recognition and Reasoning: A Survey
Yuntao Shou
Tao Meng
Wei Ai
Keqin Li
LRM
130
3
0
29 Sep 2025
Multilingual Vision-Language Models, A Survey
Multilingual Vision-Language Models, A Survey
Andrei-Alexandru Manea
Jindřich Libovický
VLM
95
1
0
26 Sep 2025
Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction
Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction
Lei Hei
Tingjing Liao
Yingxin Pei
Yiyang Qi
Jiaqi Wang
Ruiting Li
Feiliang Ren
96
0
0
25 Sep 2025
Audio-Visual Separation with Hierarchical Fusion and Representation Alignment
Audio-Visual Separation with Hierarchical Fusion and Representation Alignment
Han Hu
Dongheng Lin
Qiming Huang
Yuqi Hou
Hyung Jin Chang
Jianbo Jiao
44
0
0
24 Sep 2025
Pure Vision Language Action (VLA) Models: A Comprehensive Survey
Pure Vision Language Action (VLA) Models: A Comprehensive Survey
Dapeng Zhang
Jin Sun
Chenghui Hu
Xiaoyan Wu
Zhenlong Yuan
R. Zhou
Fei Shen
Qingguo Zhou
LM&Ro
193
13
0
23 Sep 2025
Leveraging NTPs for Efficient Hallucination Detection in VLMs
Leveraging NTPs for Efficient Hallucination Detection in VLMs
Ofir Azachi
Kfir Eliyahu
Eyal El Ani
Rom Himelstein
Roi Reichart
Yuval Pinter
Nitay Calderon
VLM
97
0
0
20 Sep 2025
Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems
Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems
Saeed Amizadeh
Sara Abdali
Yinheng Li
K. Koishida
92
0
0
18 Sep 2025
MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook
MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook
Peng Xu
Shengwu Xiong
Jiajun Zhang
Yaxiong Chen
Bowen Zhou
...
Yang Yang
Yanglin Deng
Yashu Kang
Ye Yuan
Y. Wen
LRM
83
1
0
17 Sep 2025
DepthVision: Enabling Robust Vision-Language Models with GAN-Based LiDAR-to-RGB Synthesis for Autonomous Driving
DepthVision: Enabling Robust Vision-Language Models with GAN-Based LiDAR-to-RGB Synthesis for Autonomous Driving
Sven Kirchner
Nils Purschke
Ross Greer
Alois C. Knoll
3DVVLM
92
0
0
09 Sep 2025
Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models
Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models
Hiroshi Sasaki
VLM
68
0
0
02 Sep 2025
NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks
NLKI: A lightweight Natural Language Knowledge Integration Framework for Improving Small VLMs in Commonsense VQA Tasks
Aritra Dutta
Swapnanil Mukherjee
Deepanway Ghosal
Somak Aditya
VLM
35
0
0
27 Aug 2025
JVLGS: Joint Vision-Language Gas Leak Segmentation
JVLGS: Joint Vision-Language Gas Leak Segmentation
Xinlong Zhao
Qixiang Pang
Shan Du
56
0
0
27 Aug 2025
A Multimodal-Multitask Framework with Cross-modal Relation and Hierarchical Interactive Attention for Semantic Comprehension
A Multimodal-Multitask Framework with Cross-modal Relation and Hierarchical Interactive Attention for Semantic ComprehensionInformation Fusion (Inf. Fusion), 2025
Mohammad Zia Ur Rehman
Devraj Raghuvanshi
Umang Jain
Shubhi Bansal
Nagendra Kumar
72
5
0
22 Aug 2025
Checkmate: interpretable and explainable RSVQA is the endgame
Checkmate: interpretable and explainable RSVQA is the endgame
Lucrezia Tosato
Christel Chappuis
Syrielle Montariol
F. Weissgerber
Sylvain Lobry
D. Tuia
84
0
0
18 Aug 2025
BERT-VQA: Visual Question Answering on Plots
BERT-VQA: Visual Question Answering on Plots
Tai Vu
Robert Yang
52
1
0
14 Aug 2025
Text-conditioned State Space Model For Domain-generalized Change Detection Visual Question Answering
Text-conditioned State Space Model For Domain-generalized Change Detection Visual Question Answering
Elman Ghazaei
Erchan Aptoula
96
0
0
12 Aug 2025
Remote Sensing Image Intelligent Interpretation with the Language-Centered Perspective: Principles, Methods and Challenges
Remote Sensing Image Intelligent Interpretation with the Language-Centered Perspective: Principles, Methods and Challenges
Haifeng Li
Wang Guo
Haiyang Wu
Mengwei Wu
Jipeng Zhang
Qing Zhu
Yu Liu
Xin Huang
Chao Tao
98
0
0
09 Aug 2025
DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition
DART: Dual Adaptive Refinement Transfer for Open-Vocabulary Multi-Label Recognition
Haijing Liu
Tao Pu
Hefeng Wu
Keze Wang
Liang Lin
ObjDVLM
78
0
0
07 Aug 2025
A Context-aware Attention and Graph Neural Network-based Multimodal Framework for Misogyny Detection
A Context-aware Attention and Graph Neural Network-based Multimodal Framework for Misogyny DetectionInformation Processing & Management (IPM), 2025
Mohammad Zia Ur Rehman
Sufyaan Zahoor
Areeb Manzoor
Musharaf Maqbool
Nagendra Kumar
68
19
0
07 Aug 2025
SaviorRec: Semantic-Behavior Alignment for Cold-Start Recommendation
SaviorRec: Semantic-Behavior Alignment for Cold-Start Recommendation
Yining Yao
Ziwei Li
Shuwen Xiao
Boya Du
J. Zhu
Junjun Zheng
Xiangheng Kong
Yuning Jiang
LLMSV
124
0
0
02 Aug 2025
Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation
Beyond Gloss: A Hand-Centric Framework for Gloss-Free Sign Language Translation
Sobhan Asasi
Mohamed Ilyas Lakhal
Ozge Mercanoglu Sincan
Richard Bowden
SLR
162
0
0
31 Jul 2025
Closing the Modality Gap for Mixed Modality Search
Closing the Modality Gap for Mixed Modality Search
Binxu Li
Yuhui Zhang
Xiaohan Wang
Weixin Liang
Ludwig Schmidt
Serena Yeung-Levy
VLM
88
4
0
25 Jul 2025
A Highly Clean Recipe Dataset with Ingredient States Annotation for State Probing Task
A Highly Clean Recipe Dataset with Ingredient States Annotation for State Probing Task
Mashiro Toyooka
Kiyoharu Aizawa
Yoko Yamakata
88
0
0
23 Jul 2025
What if Othello-Playing Language Models Could See?
What if Othello-Playing Language Models Could See?
Xinyi Chen
Yifei Yuan
Jiaang Li
Serge J. Belongie
Maarten de Rijke
Anders Søgaard
LRM
91
0
0
19 Jul 2025
Describe Anything Model for Visual Question Answering on Text-rich Images
Describe Anything Model for Visual Question Answering on Text-rich Images
Yen-Linh Vu
Dinh-Thang Duong
Truong-Binh Duong
Anh-Khoi Nguyen
Thanh-Huy Nguyen
...
Jianhua Xing
Xingjian Li
Tianyang Wang
Ulas Bagci
Min Xu
VLM
223
2
0
16 Jul 2025
Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation
Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation
Wenhao Li
Xiu Su
Jingyi Wu
Feng Yang
Yang-Yang Liu
Yi-Ling Chen
Shan You
Chang Xu
VLM
135
0
0
07 Jul 2025
ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays
ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays
Shehroz S. Khan
Petar Przulj
A. Ashraf
Ali Abedi
LM&MAMedIm
58
1
0
04 Jul 2025
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning
Zhangyang Qi
Zhixiong Zhang
Yizhou Yu
Jiaqi Wang
Hengshuang Zhao
LM&RoAI4TS
271
23
0
20 Jun 2025
Understanding GUI Agent Localization Biases through Logit Sharpness
Understanding GUI Agent Localization Biases through Logit Sharpness
Xingjian Tao
Yiwei Wang
Yujun Cai
Zhicheng YANG
Jing Tang
LLMAG
118
4
0
18 Jun 2025
1234...242526
Next