Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1908.03557
Cited By
VisualBERT: A Simple and Performant Baseline for Vision and Language
9 August 2019
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"VisualBERT: A Simple and Performant Baseline for Vision and Language"
50 / 1,260 papers shown
ChestGPT: Integrating Large Language Models and Vision Transformers for Disease Detection and Localization in Chest X-Rays
Shehroz S. Khan
Petar Przulj
A. Ashraf
Ali Abedi
LM&MA
MedIm
164
1
0
04 Jul 2025
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning
Zhangyang Qi
Zhixiong Zhang
Yizhou Yu
Jiaqi Wang
Hengshuang Zhao
LM&Ro
AI4TS
387
28
0
20 Jun 2025
Understanding GUI Agent Localization Biases through Logit Sharpness
Xingjian Tao
Yiwei Wang
Yujun Cai
Zhicheng YANG
Jing Tang
LLMAG
179
4
0
18 Jun 2025
Privacy-Shielded Image Compression: Defending Against Exploitation from Vision-Language Pretrained Models
Xuelin Shen
Jiayin Xu
Kangsheng Yin
Wenhan Yang
AAML
256
0
0
18 Jun 2025
Segmenting Visuals With Querying Words: Language Anchors For Semi-Supervised Image Segmentation
Numair Nadeem
Saeed Anwar
Muhammad Asad
Abdul Bais
VLM
303
0
0
16 Jun 2025
RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer
Haotian Ni
Yake Wei
Hang Liu
Gong Chen
Chong Peng
Hao Lin
Di Hu
OffRL
295
1
0
13 Jun 2025
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Xiao Xu
L. Qin
Wanxiang Che
Min-Yen Kan
MoE
VLM
308
0
0
13 Jun 2025
Vision Generalist Model: A Survey
International Journal of Computer Vision (IJCV), 2025
Ziyi Wang
Yongming Rao
Shuofeng Sun
Xinrun Liu
Yi Wei
...
Zuyan Liu
Yanbo Wang
Hongmin Liu
Jie Zhou
Jiwen Lu
305
0
0
11 Jun 2025
Multimodal Representation Alignment for Cross-modal Information Retrieval
Fan Xu
Luis A. Leiva
224
1
0
10 Jun 2025
OpenFace 3.0: A Lightweight Multitask System for Comprehensive Facial Behavior Analysis
IEEE International Conference on Automatic Face & Gesture Recognition (FG), 2025
Jiewen Hu
Leena Mathur
Paul Pu Liang
Louis-Philippe Morency
CVBM
198
1
0
03 Jun 2025
MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping
Xiaojun Shan
Qi Cao
Xing Han
Haofei Yu
Paul Liang
304
1
0
02 Jun 2025
What's Missing in Vision-Language Models? Probing Their Struggles with Causal Order Reasoning
Zhaotian Weng
Haoxuan Li
Kuan-Hao Huang
Jieyu Zhao
LRM
CoGe
201
0
0
01 Jun 2025
FinMME: Benchmark Dataset for Financial Multi-Modal Reasoning Evaluation
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Junyu Luo
Zhizhuo Kou
Liming Yang
Xiao Luo
Jinsheng Huang
...
Jiaming Ji
Xuanzhe Liu
Sirui Han
Ming Zhang
Wenhan Luo
199
14
0
30 May 2025
Multi-MLLM Knowledge Distillation for Out-of-Context News Detection
Yimeng Gu
Zhao Tong
Ignacio Castro
Shu Wu
Gareth Tyson
173
3
0
28 May 2025
LifeIR at the NTCIR-18 Lifelog-6 Task
NTCIR Conference on Evaluation of Information Access Technologies (NTCIR), 2025
Jiahan Chen
Da Li
Keping Bi
172
1
0
27 May 2025
Embodied AI with Foundation Models for Mobile Service Robots: A Systematic Review
Matthew Lisondra
B. Benhabib
G. Nejat
LM&Ro
282
2
0
26 May 2025
Multi-modal brain encoding models for multi-modal stimuli
International Conference on Learning Representations (ICLR), 2025
R. Mamidi
Khushbu Pahwa
Mounika Marreddy
Maneesh Singh
Subba Reddy Oota
Bapi S. Raju
190
9
0
26 May 2025
Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection
IEEE Transactions on Artificial Intelligence (IEEE TAI), 2025
Md. Mithun Hossain
Md. Shakil Hossain
Sudipto Chaki
M. F. Mridha
457
0
0
25 May 2025
Visual Question Answering on Multiple Remote Sensing Image Modalities
Hichem Boussaid
Lucrezia Tosato
F. Weissgerber
Camille Kurtz
Laurent Wendling
Sylvain Lobry
177
6
0
21 May 2025
Domain Adaptation of VLM for Soccer Video Understanding
Tiancheng Jiang
Henry Wang
Md Sirajus Salekin
Parmida Atighehchian
Shinan Zhang
VLM
378
3
0
20 May 2025
TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning
Lihong Chen
Hossein Hassani
Soodeh Nikan
VLM
330
4
0
19 May 2025
Multi-modal contrastive learning adapts to intrinsic dimensions of shared latent variables
Yu Gui
Cong Ma
Zongming Ma
SSL
334
2
0
18 May 2025
Hyperspectral Image Land Cover Captioning Dataset for Vision Language Models
Aryan Das
Tanishq Rachamalla
Pravendra Singh
Koushik Biswas
Vinay Kumar Verma
Swalpa Kumar Roy
VLM
242
2
0
18 May 2025
GeoMM: On Geodesic Perspective for Multi-modal Learning
Computer Vision and Pattern Recognition (CVPR), 2025
Shibin Mei
Hang Wang
Bingbing Ni
317
0
0
16 May 2025
Exploring Implicit Visual Misunderstandings in Multimodal Large Language Models through Attention Analysis
Pengfei Wang
Guohai Xu
Weinong Wang
Junjie Yang
Jie Lou
Yunhua Xue
343
2
0
15 May 2025
Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training
Yiran Chen
Yuan Yao
Tong Zhang
Heng Ji
VLM
361
1
0
13 May 2025
Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models
Conference on Uncertainty in Artificial Intelligence (UAI), 2025
Aishwarya Venkataramanan
P. Bodesheim
Joachim Denzler
BDL
VLM
415
2
0
08 May 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Wei Wei
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
...
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
1.2K
32
0
05 May 2025
Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models
Minh-Hao Van
Xintao Wu
VLM
366
2
0
30 Apr 2025
Multimodal Large Language Models for Medicine: A Comprehensive Survey
Jiarui Ye
Hao Tang
LM&MA
501
14
0
29 Apr 2025
A Survey of Task-Oriented Knowledge Graph Reasoning: Status, Applications, and Prospects
Guanglin Niu
Bo Li
Yangguang Lin
LRM
273
2
0
27 Apr 2025
Multimodal graph representation learning for website generation based on visual sketch
Tung D. Vu
Chung Hoang
Truong-Son Hy
3DV
308
1
0
25 Apr 2025
ShapeSpeak: Body Shape-Aware Textual Alignment for Visible-Infrared Person Re-Identification
Shuanglin Yan
Neng Dong
Shuang Li
Rui Yan
Hao Tang
Jing Qin
1.0K
0
0
25 Apr 2025
A Genealogy of Foundation Models in Remote Sensing
Kevin Lane
Morteza Karimzadeh
367
1
0
24 Apr 2025
Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering
International Conference on Conceptual Structures (ICCS), 2025
Ali Anaissi
Junaid Akram
Kunal Chaturvedi
Ali Braytee
261
3
0
23 Apr 2025
FrogDogNet: Fourier frequency Retained visual prompt Output Guidance for Domain Generalization of CLIP in Remote Sensing
Hariseetharam Gunduboina
Muhammad Haris Khan
Biplab Banerjee
VLM
296
2
0
23 Apr 2025
OmniV-Med: Scaling Medical Vision-Language Model for Universal Visual Understanding
Songtao Jiang
Yuan Wang
Sibo Song
Yanzhe Zhang
Zijie Meng
Bohan Lei
Jian Wu
Jimeng Sun
Zuozhu Liu
MedIm
VLM
260
11
0
20 Apr 2025
HAVT-IVD: Heterogeneity-Aware Cross-Modal Network for Audio-Visual Surveillance: Idling Vehicles Detection With Multichannel Audio and Multiscale Visual Cues
Xiwen Li
Ross T. Whitaker
Tolga Tasdizen
293
0
0
15 Apr 2025
TSAL: Few-shot Text Segmentation Based on Attribute Learning
Chenming Li
Chengxu Liu
Yuanting Fan
Xiao Jin
Xingsong Hou
Xueming Qian
VLM
341
0
0
15 Apr 2025
Zeus: Zero-shot LLM Instruction for Union Segmentation in Multimodal Medical Imaging
International Journal of Machine Learning and Cybernetics (IJMLC), 2025
Siyuan Dai
Kai Ye
Guodong Liu
Haoteng Tang
Chen Tang
MedIm
226
5
0
09 Apr 2025
DiffusionCom: Structure-Aware Multimodal Diffusion Model for Multimodal Knowledge Graph Completion
Wei Huang
M. Liang
Peining Li
Xu Hou
Yawen Li
Junping Du
Zhe Xue
Zeli Guan
DiffM
276
0
0
09 Apr 2025
A Lightweight Large Vision-language Model for Multimodal Medical Images
Belal Alsinglawi
Chris McCarthy
Sara Webb
Christopher Fluke
Navid Toosy Saidy
LM&MA
264
0
0
08 Apr 2025
SynWorld: Virtual Scenario Synthesis for Agentic Action Knowledge Refinement
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Runnan Fang
Xiaobin Wang
Yuan Liang
Shuofei Qiao
Jialong Wu
...
Ningyu Zhang
Yong Jiang
Pengjun Xie
Fei Huang
Zeyang Zhang
LLMAG
476
3
0
04 Apr 2025
ANNEXE: Unified Analyzing, Answering, and Pixel Grounding for Egocentric Interaction
Computer Vision and Pattern Recognition (CVPR), 2025
Yuejiao Su
Yi Wang
Qiongyang Hu
Chuang Yang
Lap-Pui Chau
268
4
0
02 Apr 2025
Exploring Hallucination of Large Multimodal Models in Video Understanding: Benchmark, Analysis and Mitigation
Hongcheng Gao
Jiashu Qu
Jingyi Tang
Baolong Bi
Yi Liu
Hongyu Chen
Li Liang
Li Su
Qingming Huang
MLLM
VLM
LRM
438
13
0
25 Mar 2025
FedMM-X: A Trustworthy and Interpretable Framework for Federated Multi-Modal Learning in Dynamic Environments
Sree Bhargavi Balija
FedML
192
4
0
25 Mar 2025
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Ziming Wei
Bingqian Lin
Yunshuang Nie
Jiaqi Chen
Shikui Ma
Hang Xu
Xiaodan Liang
505
3
0
23 Mar 2025
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection
Computer Vision and Pattern Recognition (CVPR), 2025
Gensheng Pei
Tao Chen
Yujia Wang
Xinhao Cai
Xiangbo Shu
Tianfei Zhou
Yazhou Yao
VLM
311
5
0
21 Mar 2025
A Survey on fMRI-based Brain Decoding for Reconstructing Multimodal Stimuli
Pengyu Liu
Guohua Dong
D. Guo
Kun Li
Fengling Li
Xun Yang
Meng Wang
Xiaomin Ying
AI4CE
268
5
0
20 Mar 2025
FusDreamer: Label-efficient Remote Sensing World Model for Multimodal Data Classification
IEEE Transactions on Geoscience and Remote Sensing (IEEE TGRS), 2025
Jiadong Wang
Weiwei Song
Hao Chen
Jie Ren
Huimin Zhao
356
3
0
18 Mar 2025
Previous
1
2
3
4
5
...
24
25
26
Next
Page 2 of 26
Page
of 26
Go