Papers
Communities
Organizations
Events
Blog
Pricing
Feedback
Contact Sales
Search
Open menu
Home
Papers
2005.07310
Cited By
v1
v2 (latest)
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
15 May 2020
Jize Cao
Zhe Gan
Yu Cheng
Licheng Yu
Yen-Chun Chen
Jingjing Liu
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models"
50 / 59 papers shown
Title
An Empirical Study on How Video-LLMs Answer Video Questions
Chenhui Gou
Ziyu Ma
Zicheng Duan
Haoyu He
Feng Chen
Akide Liu
Bohan Zhuang
Jianfei Cai
H. Rezatofighi
20
0
0
21 Aug 2025
Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs
Yaniv Nikankin
Dana Arad
Yossi Gandelsman
Yonatan Belinkov
112
1
0
10 Jun 2025
FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering
Chengyue Huang
Brisa Maneechotesuwan
Shivang Chopra
Z. Kira
AAML
112
1
0
27 May 2025
TerraMind: Large-Scale Generative Multimodality for Earth Observation
Johannes Jakubik
Felix Yang
Benedikt Blumenstiel
Erik Scheurer
Rocco Sedona
...
P. Fraccaro
Thomas Brunschwiler
Gabriele Cavallaro
Juan Bernabé-Moreno
Nicolas Longépé
MLLM
VLM
211
15
0
15 Apr 2025
Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens
Zhangqi Jiang
Junkai Chen
Beier Zhu
Tingjin Luo
Yankun Shen
Xu Yang
249
19
0
23 Nov 2024
Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024
Jinwoo Ahn
Junhyeok Park
Min-Jun Kim
Kang-Hyeon Kim
So-Yeong Sohn
Yun-Ji Lee
Du-Seong Chang
Yu-Jung Heo
Eun-Sol Kim
LRM
94
0
0
10 Jun 2024
INSPECT: Intrinsic and Systematic Probing Evaluation for Code Transformers
Anjan Karmakar
Romain Robbes
105
5
0
08 Dec 2023
The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges
Maria Lymperaiou
Giorgos Stamou
VLM
116
4
0
04 Mar 2023
Controlling for Stereotypes in Multimodal Language Model Evaluation
Manuj Malik
Richard Johansson
155
1
0
03 Feb 2023
MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks
Letitia Parcalabescu
Anette Frank
106
35
0
15 Dec 2022
A survey on knowledge-enhanced multimodal learning
Maria Lymperaiou
Giorgos Stamou
209
20
0
19 Nov 2022
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
Fenglin Liu
Xian Wu
Shen Ge
Xuancheng Ren
Wei Fan
Xu Sun
Yuexian Zou
VLM
127
13
0
28 Oct 2022
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?
Mitja Nikolaus
Emmanuelle Salin
Stéphane Ayache
Abdellah Fourtassi
Benoit Favre
95
14
0
21 Oct 2022
EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning
Tiannan Wang
Wangchunshu Zhou
Yan Zeng
Xinsong Zhang
VLM
116
53
0
14 Oct 2022
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks
Gregor Geigle
Chen Cecilia Liu
Jonas Pfeiffer
Iryna Gurevych
VLM
80
1
0
12 Oct 2022
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
Haoxuan You
Luowei Zhou
Bin Xiao
Noel Codella
Yu Cheng
Ruochen Xu
Shih-Fu Chang
Lu Yuan
CLIP
VLM
100
46
0
26 Jul 2022
VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
Teng Wang
Wenhao Jiang
Zhichao Lu
Feng Zheng
Ran Cheng
Chengguo Yin
Ping Luo
VLM
94
49
0
17 Jun 2022
Multimodal Learning with Transformers: A Survey
Peng Xu
Xiatian Zhu
David Clifton
ViT
296
652
0
13 Jun 2022
Delving into the Openness of CLIP
Shuhuai Ren
Lei Li
Xuancheng Ren
Guangxiang Zhao
Xu Sun
VLM
144
13
0
04 Jun 2022
VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
Wangchunshu Zhou
Yan Zeng
Shizhe Diao
Xinsong Zhang
CoGe
VLM
132
13
0
30 May 2022
Visualizing and Explaining Language Models
Adrian M. P. Braşoveanu
Razvan Andonie
MILM
VLM
153
7
0
30 Apr 2022
Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly
Spencer Whitehead
Suzanne Petryk
Vedaad Shakib
Joseph E. Gonzalez
Trevor Darrell
Anna Rohrbach
Marcus Rohrbach
153
64
0
28 Apr 2022
Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks
Zhecan Wang
Noel Codella
Yen-Chun Chen
Luowei Zhou
Xiyang Dai
...
Jianwei Yang
Haoxuan You
Kai-Wei Chang
Shih-Fu Chang
Lu Yuan
VLM
OffRL
113
27
0
22 Apr 2022
Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Tristan Thrush
Ryan Jiang
Max Bartolo
Amanpreet Singh
Adina Williams
Douwe Kiela
Candace Ross
CoGe
198
456
0
07 Apr 2022
On Explaining Multimodal Hateful Meme Detection Models
Ming Shan Hee
Roy Ka-wei Lee
Wen-Haw Chong
VLM
147
45
0
04 Apr 2022
VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers
Estelle Aflalo
Meng Du
Shao-Yen Tseng
Yongfei Liu
Chenfei Wu
Nan Duan
Vasudev Lal
126
51
0
30 Mar 2022
Grounding Commands for Autonomous Vehicles via Layer Fusion with Region-specific Dynamic Layer Attention
Hou Pong Chan
M. Guo
Chengguang Xu
91
4
0
14 Mar 2022
DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local Explanations
Yiwei Lyu
Paul Pu Liang
Zihao Deng
Ruslan Salakhutdinov
Louis-Philippe Morency
115
42
0
03 Mar 2022
Vision-Language Intelligence: Tasks, Representation Learning, and Large Models
Feng Li
Hao Zhang
Yi-Fan Zhang
Shixuan Liu
Jian Guo
L. Ni
Pengchuan Zhang
Lei Zhang
AI4TS
VLM
109
38
0
03 Mar 2022
CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks
Zhecan Wang
Noel Codella
Yen-Chun Chen
Luowei Zhou
Jianwei Yang
Xiyang Dai
Bin Xiao
Haoxuan You
Shih-Fu Chang
Lu Yuan
CLIP
VLM
97
43
0
15 Jan 2022
VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
Letitia Parcalabescu
Michele Cafagna
Lilitta Muradjan
Anette Frank
Iacer Calixto
Albert Gatt
CoGe
140
121
0
14 Dec 2021
MLP Architectures for Vision-and-Language Modeling: An Empirical Study
Yi-Liang Nie
Linjie Li
Zhe Gan
Shuohang Wang
Chenguang Zhu
Michael Zeng
Zicheng Liu
Joey Tianyi Zhou
Lijuan Wang
77
7
0
08 Dec 2021
Scaling Up Vision-Language Pre-training for Image Captioning
Xiaowei Hu
Zhe Gan
Jianfeng Wang
Zhengyuan Yang
Zicheng Liu
Yumao Lu
Lijuan Wang
MLLM
VLM
214
259
0
24 Nov 2021
TraVLR: Now You See It, Now You Don't! A Bimodal Dataset for Evaluating Visio-Linguistic Reasoning
Keng Ji Chow
Samson Tan
MingSung Kan
LRM
114
5
0
21 Nov 2021
History Aware Multimodal Transformer for Vision-and-Language Navigation
Shizhe Chen
Pierre-Louis Guhur
Cordelia Schmid
Ivan Laptev
LM&Ro
159
254
0
25 Oct 2021
A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
Woojeong Jin
Yu Cheng
Yelong Shen
Weizhu Chen
Xiang Ren
VLM
VPVLM
MLLM
177
148
0
16 Oct 2021
Does Vision-and-Language Pretraining Improve Lexical Grounding?
Tian Yun
Chen Sun
Ellie Pavlick
VLM
CoGe
122
34
0
21 Sep 2021
Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers
Stella Frank
Emanuele Bugliarello
Desmond Elliott
101
85
0
09 Sep 2021
ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration
Yuhao Cui
Zhou Yu
Chunqi Wang
Zhongzhou Zhao
Ji Zhang
Meng Wang
Jun-chen Yu
VLM
86
58
0
16 Aug 2021
OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Representation Learning
Sheng Liu
Kevin Qinghong Lin
Lijuan Wang
Junsong Yuan
Zicheng Liu
VLM
60
3
0
08 Aug 2021
Exceeding the Limits of Visual-Linguistic Multi-Task Learning
Cameron R. Wolfe
Keld T. Lundgaard
VLM
96
3
0
27 Jul 2021
Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training
Hongwei Xue
Yupan Huang
Bei Liu
Houwen Peng
Jianlong Fu
Houqiang Li
Jiebo Luo
195
89
0
25 Jun 2021
Probing Image-Language Transformers for Verb Understanding
Lisa Anne Hendricks
Aida Nematzadeh
98
9
0
16 Jun 2021
Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models
Linjie Li
Jie Lei
Zhe Gan
Jingjing Liu
AAML
VLM
124
79
0
01 Jun 2021
Playing Lottery Tickets with Vision and Language
Zhe Gan
Yen-Chun Chen
Linjie Li
Tianlong Chen
Yu Cheng
Shuohang Wang
Jingjing Liu
Lijuan Wang
Zicheng Liu
VLM
190
60
0
23 Apr 2021
Effect of Visual Extensions on Natural Language Understanding in Vision-and-Language Models
Taichi Iki
Akiko Aizawa
VLM
84
21
0
16 Apr 2021
The Road to Know-Where: An Object-and-Room Informed Sequential BERT for Indoor Vision-Language Navigation
Yuankai Qi
Zizheng Pan
Yicong Hong
Ming-Hsuan Yang
Anton Van Den Hengel
Qi Wu
LM&Ro
98
71
0
09 Apr 2021
How Transferable are Reasoning Patterns in VQA?
Corentin Kervadec
Theo Jaunet
G. Antipov
M. Baccouche
Romain Vuillemot
Christian Wolf
LRM
75
28
0
08 Apr 2021
Compressing Visual-linguistic Model via Knowledge Distillation
Zhiyuan Fang
Jianfeng Wang
Xiaowei Hu
Lijuan Wang
Yezhou Yang
Zicheng Liu
VLM
151
106
0
05 Apr 2021
VisQA: X-raying Vision and Language Reasoning in Transformers
Theo Jaunet
Corentin Kervadec
Romain Vuillemot
G. Antipov
M. Baccouche
Christian Wolf
100
29
0
02 Apr 2021
1
2
Next