Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.03557
Cited By
VisualBERT: A Simple and Performant Baseline for Vision and Language
9 August 2019
Liunian Harold Li
Mark Yatskar
Da Yin
Cho-Jui Hsieh
Kai-Wei Chang
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"VisualBERT: A Simple and Performant Baseline for Vision and Language"
50 / 297 papers shown
Title
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
X. Zhang
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
62
0
0
05 May 2025
Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models
Minh-Hao Van
Xintao Wu
VLM
81
0
0
30 Apr 2025
Multimodal Large Language Models for Medicine: A Comprehensive Survey
Jiarui Ye
Hao Tang
LM&MA
84
0
0
29 Apr 2025
Multimodal graph representation learning for website generation based on visual sketch
Tung D. Vu
Chung Hoang
Truong-Son Hy
3DV
51
0
0
25 Apr 2025
ShapeSpeak: Body Shape-Aware Textual Alignment for Visible-Infrared Person Re-Identification
Shuanglin Yan
Neng Dong
Shuang Li
Rui Yan
Hao Tang
Jing Qin
83
0
0
25 Apr 2025
Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering
Ali Anaissi
Junaid Akram
Kunal Chaturvedi
Ali Braytee
22
0
0
23 Apr 2025
FlowTok: Flowing Seamlessly Across Text and Image Tokens
Ju He
Qihang Yu
Qihao Liu
Liang-Chieh Chen
66
0
0
13 Mar 2025
FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA
S M Sarwar
66
1
0
25 Feb 2025
Vision-Language Models for Edge Networks: A Comprehensive Survey
Ahmed Sharshar
Latif U. Khan
Waseem Ullah
Mohsen Guizani
VLM
62
3
0
11 Feb 2025
A Multimodal PDE Foundation Model for Prediction and Scientific Text Descriptions
Elisa Negrini
Yuxuan Liu
Liu Yang
Stanley Osher
Hayden Schaeffer
AI4CE
86
0
0
09 Feb 2025
Mitigating GenAI-powered Evidence Pollution for Out-of-Context Multimodal Misinformation Detection
Zehong Yan
Peng Qi
W. Hsu
M. Lee
39
0
0
24 Jan 2025
MASS: Overcoming Language Bias in Image-Text Matching
Jiwan Chung
Seungwon Lim
Sangkyu Lee
Youngjae Yu
VLM
30
0
0
20 Jan 2025
Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios
Shantanu Jaiswal
Debaditya Roy
Basura Fernando
Cheston Tan
ReLM
LRM
66
2
0
20 Nov 2024
Prompt-enhanced Network for Hateful Meme Classification
Junxi Liu
Yanyan Feng
Jiehai Chen
Yun Xue
Fenghuan Li
VLM
53
0
0
12 Nov 2024
Robust 3D Point Clouds Classification based on Declarative Defenders
Kaidong Li
Tianxiao Zhang
Cuncong Zhong
Z. Zhang
G. Wang
3DPC
34
1
0
13 Oct 2024
Recent Advances of Multimodal Continual Learning: A Comprehensive Survey
Dianzhi Yu
Xinni Zhang
Yankai Chen
Aiwei Liu
Yifei Zhang
Philip S. Yu
Irwin King
VLM
CLL
39
9
0
07 Oct 2024
Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity
Hanqi Jiang
Xixuan Hao
Yuzhou Huang
Chong Ma
Jiaxun Zhang
Yi Pan
Ruimao Zhang
MedIm
30
0
0
01 Oct 2024
MIO: A Foundation Model on Multimodal Tokens
Zekun Wang
King Zhu
Chunpu Xu
Wangchunshu Zhou
Jiaheng Liu
...
Yuanxing Zhang
Ge Zhang
Ke Xu
Jie Fu
Wenhao Huang
MLLM
AuLLM
51
11
0
26 Sep 2024
Embodiment-Agnostic Action Planning via Object-Part Scene Flow
Weiliang Tang
Jia-Hui Pan
Wei Zhan
Jianshu Zhou
Huaxiu Yao
Yun-Hui Liu
M. Tomizuka
Mingyu Ding
Chi-Wing Fu
41
0
0
16 Sep 2024
ComAlign: Compositional Alignment in Vision-Language Models
Ali Abdollah
Amirmohammad Izadi
Armin Saghafian
Reza Vahidimajd
Mohammad Mozafari
Amirreza Mirzaei
Mohammadmahdi Samiei
M. Baghshah
CoGe
VLM
30
0
0
12 Sep 2024
TG-LMM: Enhancing Medical Image Segmentation Accuracy through Text-Guided Large Multi-Modal Model
Yihao Zhao
Enhao Zhong
Cuiyun Yuan
Yang Li
Man Zhao
Chunxia Li
Jun Hu
Chenbin Liu
VLM
MedIm
36
0
0
05 Sep 2024
Towards Zero-Shot Multimodal Machine Translation
Matthieu Futeral
Cordelia Schmid
Benoît Sagot
Rachel Bawden
30
3
0
18 Jul 2024
Elevating All Zero-Shot Sketch-Based Image Retrieval Through Multimodal Prompt Learning
Mainak Singha
Ankit Jha
Divyam Gupta
Pranav Singla
Biplab Banerjee
VLM
32
0
0
05 Jul 2024
ARES: Alternating Reinforcement Learning and Supervised Fine-Tuning for Enhanced Multi-Modal Chain-of-Thought Reasoning Through Diverse AI Feedback
Ju-Seung Byun
Jiyun Chun
Jihyung Kil
Andrew Perrault
ReLM
LRM
27
1
0
25 Jun 2024
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation
Michal Golovanevsky
William Rudman
Vedant Palit
Ritambhara Singh
Carsten Eickhoff
31
1
0
24 Jun 2024
Learning Domain-Invariant Features for Out-of-Context News Detection
Yimeng Gu
Mengqi Zhang
Ignacio Castro
Shu Wu
Gareth Tyson
35
2
0
11 Jun 2024
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
Hao Fang
Jiawei Kong
Wenbo Yu
Bin Chen
Jiawei Li
Hao Wu
Ke Xu
Ke Xu
AAML
VLM
30
13
0
08 Jun 2024
Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models
Himangi Mittal
Nakul Agarwal
Shao-Yuan Lo
Kwonjoon Lee
30
13
0
30 May 2024
Enhancing Large Vision Language Models with Self-Training on Image Comprehension
Yihe Deng
Pan Lu
Fan Yin
Ziniu Hu
Sheng Shen
James Y. Zou
Kai-Wei Chang
Wei Wang
SyDa
VLM
LRM
36
36
0
30 May 2024
MetaToken: Detecting Hallucination in Image Descriptions by Meta Classification
Laura Fieback
Jakob Spiegelberg
Hanno Gottschalk
MLLM
57
5
0
29 May 2024
FinEmbedDiff: A Cost-Effective Approach of Classifying Financial Documents with Vector Sampling using Multi-modal Embedding Models
Anjanava Biswas
Wrick Talukdar
16
1
0
28 May 2024
Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning
N. Kalibhat
Priyatham Kattakinda
Arman Zarei
Nikita Seleznev
Sam Sharpe
Senthil Kumar
S. Feizi
ViT
34
0
0
26 May 2024
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
Xiyao Wang
Jiuhai Chen
Zhaoyang Wang
Yuhang Zhou
Yiyang Zhou
...
Tianyi Zhou
Tom Goldstein
Parminder Bhatia
Furong Huang
Cao Xiao
60
33
0
24 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
67
41
0
23 May 2024
STAR: A Benchmark for Situated Reasoning in Real-World Videos
Bo Wu
Shoubin Yu
Zhenfang Chen
Joshua B Tenenbaum
Chuang Gan
33
176
0
15 May 2024
Realizing Visual Question Answering for Education: GPT-4V as a Multimodal AI
Gyeong-Geon Lee
Xiaoming Zhai
29
4
0
12 May 2024
PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering
Yihao Ding
Kaixuan Ren
Jiabin Huang
Siwen Luo
S. Han
35
1
0
19 Apr 2024
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Jie Ma
Min Hu
Pinghui Wang
Wangchun Sun
Lingyun Song
Hongbin Pei
Jun Liu
Youtian Du
32
4
0
18 Apr 2024
AlignZeg: Mitigating Objective Misalignment for Zero-shot Semantic Segmentation
Jiannan Ge
Lingxi Xie
Hongtao Xie
Pandeng Li
Xiaopeng Zhang
Yongdong Zhang
Qi Tian
VLM
16
3
0
08 Apr 2024
Contextual Chart Generation for Cyber Deception
David D. Nguyen
David Liebowitz
Surya Nepal
S. Kanhere
Sharif Abuadbba
41
0
0
07 Apr 2024
VideoDistill: Language-aware Vision Distillation for Video Question Answering
Bo Zou
Chao Yang
Yu Qiao
Chengbin Quan
Youjian Zhao
VGen
39
1
0
01 Apr 2024
Unknown Prompt, the only Lacuna: Unveiling CLIP's Potential for Open Domain Generalization
Mainak Singha
Ankit Jha
Shirsha Bose
Ashwin Nair
Moloud Abdar
Biplab Banerjee
VLM
35
10
0
31 Mar 2024
FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint Textual and Visual Clues
Shuang Li
Jiahua Wang
Lijie Wen
LRM
21
0
0
29 Mar 2024
Surgical-LVLM: Learning to Adapt Large Vision-Language Model for Grounded Visual Question Answering in Robotic Surgery
Guan-Feng Wang
Long Bai
Wan Jun Nah
Jie Wang
Zhaoxi Zhang
Zhen Chen
Jinlin Wu
Mobarakol Islam
Hongbin Liu
Hongliang Ren
40
14
0
22 Mar 2024
Grounding Spatial Relations in Text-Only Language Models
Gorka Azkune
Ander Salaberria
Eneko Agirre
34
0
0
20 Mar 2024
Prioritized Semantic Learning for Zero-shot Instance Navigation
Xander Sun
Louis Lau
Hoyard Zhi
Ronghe Qiu
Junwei Liang
30
8
0
18 Mar 2024
Deciphering Hate: Identifying Hateful Memes and Their Targets
E. Hossain
Omar Sharif
M. M. Hoque
S. Preum
44
4
0
16 Mar 2024
GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery
Enguang Wang
Zhimao Peng
Zhengyuan Xie
Fei Yang
Xialei Liu
Ming-Ming Cheng
54
3
0
15 Mar 2024
Efficient Prompt Tuning of Large Vision-Language Model for Fine-Grained Ship Classification
Long Lan
Fengxiang Wang
Shuyan Li
Xiangtao Zheng
Zengmao Wang
Xinwang Liu
VLM
24
7
0
13 Mar 2024
Acquiring Linguistic Knowledge from Multimodal Input
Theodor Amariucai
Alexander Scott Warstadt
CLL
26
2
0
27 Feb 2024
1
2
3
4
5
6
Next