Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1908.08530
Cited By
v1
v2
v3
v4 (latest)
VL-BERT: Pre-training of Generic Visual-Linguistic Representations
International Conference on Learning Representations (ICLR), 2019
22 August 2019
Weijie Su
Xizhou Zhu
Yue Cao
Bin Li
Lewei Lu
Furu Wei
Jifeng Dai
VLM
MLLM
SSL
Re-assign community
ArXiv (abs)
PDF
HTML
Github (740★)
Papers citing
"VL-BERT: Pre-training of Generic Visual-Linguistic Representations"
50 / 1,047 papers shown
VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models
Silin Cheng
Kai Han
MLLM
VPVLM
VLM
349
3
0
27 Nov 2025
TOFA: Training-Free One-Shot Federated Adaptation for Vision-Language Models
Li Zhang
Zhongxuan Han
Xiaohua Feng
Jiaming Zhang
Yuyuan Li
Linbo Jiang
Jianan Lin
Chaochao Chen
FedML
VLM
504
1
0
20 Nov 2025
DiffPixelFormer: Differential Pixel-Aware Transformer for RGB-D Indoor Scene Segmentation
Yan Gong
J. Lu
Yongsheng Gao
Jie Zhao
X. Zhang
Susanto Rahardja
158
0
0
17 Nov 2025
A Retrospect to Multi-prompt Learning across Vision and Language
IEEE International Conference on Computer Vision (ICCV), 2023
Ziliang Chen
Xin Huang
Quanlong Guan
Liang Lin
Weiqi Luo
VPVLM
VLM
474
12
0
31 Oct 2025
FOCUS: Efficient Keyframe Selection for Long Video Understanding
Zirui Zhu
Hailun Xu
Yang Luo
Yong Liu
Kanchan Sarkar
Zhenheng Yang
Yang You
239
9
0
31 Oct 2025
Masked Diffusion Captioning for Visual Feature Learning
Chao Feng
Zihao Wei
Andrew Owens
DiffM
348
0
0
30 Oct 2025
Structure-Aware Fusion with Progressive Injection for Multimodal Molecular Representation Learning
Zihao Jing
Yan Sun
Yan Yi Li
Sugitha Janarthanan
Alana Deng
Pingzhao Hu
145
2
0
24 Oct 2025
Modest-Align: Data-Efficient Alignment for Vision-Language Models
Jiaxiang Liu
Yuan Wang
Jiawei Du
Joey Tianyi Zhou
Mingkun Xu
Zuozhu Liu
VLM
171
0
0
24 Oct 2025
ELMM: Efficient Lightweight Multimodal Large Language Models for Multimodal Knowledge Graph Completion
Wei Huang
Peining Li
Meiyu Liang
Xu Hou
Junping Du
Yingxia Shao
Guanhua Ye
Wu Liu
Kangkang Lu
Yang Yu
VLM
314
1
0
19 Oct 2025
FlexiReID: Adaptive Mixture of Expert for Multi-Modal Person Re-Identification
Zhen Sun
Lei Tan
Yunhang Shen
Chengmao Cai
Xing Sun
Pingyang Dai
Liujuan Cao
Rongrong Ji
125
3
0
17 Oct 2025
Vision-Centric Activation and Coordination for Multimodal Large Language Models
Yunnan Wang
Fan Lu
Kecheng Zheng
Ziyuan Huang
Ziqiang Li
Wenjun Zeng
Xin Jin
MLLM
424
1
0
16 Oct 2025
CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization
Fengling Zhu
Boshi Liu
Jingyu Hua
Sheng Zhong
DiffM
AAML
186
0
0
13 Oct 2025
Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
Weikai Huang
Jieyu Zhang
Taoyang Jia
Chenhao Zheng
Ziqi Gao
J. S. Park
Winson Han
Ranjay Krishna
292
0
0
10 Oct 2025
Cluster-Aware Prompt Ensemble Learning for Few-Shot Vision-Language Model Adaptation
Pattern Recognition (Pattern Recogn.), 2025
Zhi Chen
Xin Yu
Xiaohui Tao
Yan Li
Zi Huang
VLM
236
12
0
10 Oct 2025
Multilingual Vision-Language Models, A Survey
Andrei-Alexandru Manea
Jindřich Libovický
VLM
213
1
0
26 Sep 2025
Integrating Object Interaction Self-Attention and GAN-Based Debiasing for Visual Question Answering
Zhifei Li
Feng Qiu
Yiran Wang
Yujing Xia
Kui Xiao
Miao Zhang
Yan Zhang
243
0
0
25 Sep 2025
Copycat vs. Original: Multi-modal Pretraining and Variable Importance in Box-office Prediction
Qin Chao
Eunsoo Kim
Boyang Albert Li
185
0
0
18 Sep 2025
MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook
Peng Xu
Shengwu Xiong
Jiajun Zhang
Yaxiong Chen
Bowen Zhou
...
Yang Yang
Yanglin Deng
Yashu Kang
Ye Yuan
Y. Wen
LRM
163
4
0
17 Sep 2025
Data Leakage in Visual Datasets
Patrick Ramos
Ryan Ramos
Noa Garcia
PILM
297
3
0
24 Aug 2025
Checkmate: interpretable and explainable RSVQA is the endgame
Lucrezia Tosato
Christel Chappuis
Syrielle Montariol
F. Weissgerber
Sylvain Lobry
D. Tuia
231
1
0
18 Aug 2025
A Curriculum Learning Approach to Reinforcement Learning: Leveraging RAG for Multimodal Question Answering
Chenliang Zhang
Lin Wang
Yuanyuan Lu
Yusheng Qi
Kexin Wang
P. Hou
Wenshi Chen
RALM
218
1
0
14 Aug 2025
Remote Sensing Image Intelligent Interpretation with the Language-Centered Perspective: Principles, Methods and Challenges
Haifeng Li
Wang Guo
Haiyang Wu
Mengwei Wu
Jipeng Zhang
Qing Zhu
Yu Liu
Xin Huang
Chao Tao
218
2
0
09 Aug 2025
ModalFormer: Multimodal Transformer for Low-Light Image Enhancement
Alexandru Brateanu
Raul Balmez
Ciprian Orhei
C. Ancuti
Cosmin Ancuti
ViT
OffRL
282
0
0
27 Jul 2025
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning
Zhangyang Qi
Zhixiong Zhang
Yizhou Yu
Jiaqi Wang
Hengshuang Zhao
LM&Ro
AI4TS
485
46
0
20 Jun 2025
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Xiao Xu
L. Qin
Wanxiang Che
Min-Yen Kan
MoE
VLM
333
1
0
13 Jun 2025
Biases Propagate in Encoder-based Vision-Language Models: A Systematic Analysis From Intrinsic Measures to Zero-shot Retrieval Outcomes
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Kshitish Ghate
Tessa E. S. Charlesworth
Mona Diab
Aylin Caliskan
VLM
221
3
0
06 Jun 2025
OpenFace 3.0: A Lightweight Multitask System for Comprehensive Facial Behavior Analysis
IEEE International Conference on Automatic Face & Gesture Recognition (FG), 2025
Jiewen Hu
Leena Mathur
Paul Pu Liang
Louis-Philippe Morency
CVBM
233
2
0
03 Jun 2025
MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping
Xiaojun Shan
Qi Cao
Xing Han
Haofei Yu
Paul Liang
389
3
0
02 Jun 2025
TS-VLM: Text-Guided SoftSort Pooling for Vision-Language Models in Multi-View Driving Reasoning
Lihong Chen
Hossein Hassani
Soodeh Nikan
VLM
369
4
0
19 May 2025
A Light and Smart Wearable Platform with Multimodal Foundation Model for Enhanced Spatial Reasoning in People with Blindness and Low Vision
Alexey Magay
Dhurba Tripathi
Yu Hao
Yi Fang
318
2
0
16 May 2025
GeoMM: On Geodesic Perspective for Multi-modal Learning
Computer Vision and Pattern Recognition (CVPR), 2025
Shibin Mei
Hang Wang
Bingbing Ni
353
0
0
16 May 2025
A Survey of Task-Oriented Knowledge Graph Reasoning: Status, Applications, and Prospects
Guanglin Niu
Bo Li
Yangguang Lin
LRM
330
2
0
27 Apr 2025
Securing the Skies: A Comprehensive Survey on Anti-UAV Methods, Benchmarking, and Future Directions
Yifei Dong
Fengyi Wu
Sanjian Zhang
Guangyu Chen
Yuzhi Hu
...
Yuxuan Zhou
Siyu Huang
Feng Liu
Jingdong Sun
Zhi-Qi Cheng
615
17
0
16 Apr 2025
HAVT-IVD: Heterogeneity-Aware Cross-Modal Network for Audio-Visual Surveillance: Idling Vehicles Detection With Multichannel Audio and Multiscale Visual Cues
Xiwen Li
Ross T. Whitaker
Tolga Tasdizen
424
0
0
15 Apr 2025
DiffusionCom: Structure-Aware Multimodal Diffusion Model for Multimodal Knowledge Graph Completion
Wei Huang
M. Liang
Peining Li
Xu Hou
Yawen Li
Junping Du
Zhe Xue
Zeli Guan
DiffM
313
0
0
09 Apr 2025
Group-based Distinctive Image Captioning with Memory Difference Encoding and Attention
International Journal of Computer Vision (IJCV), 2024
Jiuniu Wang
Wenjia Xu
Qingzhong Wang
Antoni B. Chan
500
3
0
03 Apr 2025
UFM: Unified Feature Matching Pre-training with Multi-Modal Image Assistants
PLoS ONE (PLoS ONE), 2025
Yide Di
Yun Liao
Hao Zhou
Kaijun Zhu
Qing Duan
Junhui Liu
Mingyu Lu
212
3
0
26 Mar 2025
MAGIC-VQA: Multimodal And Grounded Inference with Commonsense Knowledge for Visual Question Answering
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Shuo Yang
Siwen Luo
S. Han
Eduard Hovy
LRM
515
19
0
24 Mar 2025
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection
Computer Vision and Pattern Recognition (CVPR), 2025
Gensheng Pei
Tao Chen
Yujia Wang
Xinhao Cai
Xiangbo Shu
Tianfei Zhou
Yazhou Yao
VLM
387
9
0
21 Mar 2025
DynRsl-VLM: Enhancing Autonomous Driving Perception with Dynamic Resolution Vision-Language Models
Xirui Zhou
Lianlei Shan
Xiaolin Gui
285
21
0
14 Mar 2025
Anatomy-Aware Conditional Image-Text Retrieval
Meng Zheng
Jiajin Zhang
Benjamin Planche
Zhongpai Gao
Terrence Chen
Ziyan Wu
MedIm
290
0
0
10 Mar 2025
MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations
Computer Vision and Pattern Recognition (CVPR), 2025
Ziyang Zhang
Yang Yu
Yucheng Chen
Xulei Yang
S. Yeo
MedIm
613
12
0
02 Mar 2025
FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA
S M Sarwar
586
3
0
25 Feb 2025
Vision-Language Models for Edge Networks: A Comprehensive Survey
IEEE Internet of Things Journal (IEEE IoT J.), 2025
Ahmed Sharshar
Latif U. Khan
Waseem Ullah
Mohsen Guizani
VLM
407
3
0
11 Feb 2025
Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation
Lin Chen
Qi Yang
Kun Ding
Tianying Wang
Gang Shen
Fei Li
Qiyuan Cao
Shiming Xiang
VLM
268
3
0
29 Jan 2025
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
Computer Vision and Pattern Recognition (CVPR), 2025
Miran Heo
Min-Hung Chen
De-An Huang
Sifei Liu
Subhashree Radhakrishnan
Seon Joo Kim
Yu-Chun Wang
Ryo Hachiuma
ObjD
VLM
566
14
0
14 Jan 2025
Benchmarking Large and Small MLLMs
Xuelu Feng
Yunsheng Li
DongDong Chen
Mei Gao
Mengchen Liu
Junsong Yuan
Chunming Qiao
158
6
0
04 Jan 2025
VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
Neural Information Processing Systems (NeurIPS), 2024
Jiannan Wu
Muyan Zhong
Sen Xing
Zeqiang Lai
Zhaoyang Liu
...
Lewei Lu
Tong Lu
Ping Luo
Yu Qiao
Jifeng Dai
MLLM
VLM
LRM
1.0K
145
0
03 Jan 2025
Towards Visual Grounding: A Survey
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Linhui Xiao
Xiaoshan Yang
X. Lan
Yaowei Wang
Changsheng Xu
ObjD
1.1K
43
0
28 Dec 2024
Cross-Modal Few-Shot Learning with Second-Order Neural Ordinary Differential Equations
AAAI Conference on Artificial Intelligence (AAAI), 2024
Yi Zhang
Chun-Wun Cheng
Junyi He
Zhihai He
Carola-Bibiane Schonlieb
Yuyan Chen
Angelica I Aviles-Rivero
AI4TS
405
1
0
20 Dec 2024
1
2
3
4
...
19
20
21
Next
Page 1 of 21
Page
of 21
Go