Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2005.07310
Cited By
v1
v2 (latest)
Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models
15 May 2020
Jize Cao
Zhe Gan
Yu Cheng
Licheng Yu
Yen-Chun Chen
Jingjing Liu
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Behind the Scene: Revealing the Secrets of Pre-trained Vision-and-Language Models"
50 / 90 papers shown
Diagnosing Bottlenecks in Data Visualization Understanding by Vision-Language Models
Alexa R. Tartaglini
Satchel Grant
Daniel Wurgaft
Christopher Potts
Judith E. Fan
125
0
0
02 Oct 2025
REMA: A Unified Reasoning Manifold Framework for Interpreting Large Language Model
Bo Li
Guanzhi Deng
Ronghao Chen
Junrong Yue
Shuo Zhang
Qinghua Zhao
Linqi Song
Lijie Wen
LRM
120
1
0
26 Sep 2025
An Empirical Study on How Video-LLMs Answer Video Questions
Chenhui Gou
Ziyu Ma
Zicheng Duan
Haoyu He
Feng Chen
Akide Liu
Bohan Zhuang
Jianfei Cai
H. Rezatofighi
156
1
0
21 Aug 2025
Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs
Yaniv Nikankin
Dana Arad
Yossi Gandelsman
Yonatan Belinkov
324
7
0
10 Jun 2025
FRAMES-VQA: Benchmarking Fine-Tuning Robustness across Multi-Modal Shifts in Visual Question Answering
Computer Vision and Pattern Recognition (CVPR), 2025
Chengyue Huang
Brisa Maneechotesuwan
Shivang Chopra
Z. Kira
AAML
290
4
0
27 May 2025
Vision-Language-Action (VLA) Models: Concepts, Progress, Applications and Challenges
Ranjan Sapkota
Yang Cao
Konstantinos I. Roumeliotis
Manoj Karkee
LM&Ro
1.0K
44
0
07 May 2025
TerraMind: Large-Scale Generative Multimodality for Earth Observation
Johannes Jakubik
Felix Yang
Benedikt Blumenstiel
Erik Scheurer
Rocco Sedona
...
P. Fraccaro
Thomas Brunschwiler
Gabriele Cavallaro
Juan Bernabé-Moreno
Alessandra Feliciotti
MLLM
VLM
497
47
0
15 Apr 2025
COSMO: Combination of Selective Memorization for Low-cost Vision-and-Language Navigation
Siqi Zhang
Yanyuan Qiao
Qunbo Wang
Zike Yan
Qi Wu
Zhihua Wei
Qingbin Liu
534
3
0
31 Mar 2025
A Review of Multimodal Explainable Artificial Intelligence: Past, Present and Future
Shilin Sun
Wenbin An
Feng Tian
Fang Nan
Qidong Liu
Jing Liu
N. Shah
Ping Chen
389
20
0
18 Dec 2024
Explainable and Interpretable Multimodal Large Language Models: A Comprehensive Survey
Yunkai Dang
Kaichen Huang
Jiahao Huo
Yibo Yan
Shijie Huang
...
Kun Wang
Yong Liu
Jing Shao
Hui Xiong
Xuming Hu
LRM
434
56
0
03 Dec 2024
Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens
Computer Vision and Pattern Recognition (CVPR), 2024
Zhangqi Jiang
Junkai Chen
Beier Zhu
Tingjin Luo
Yankun Shen
Xu Yang
532
58
0
23 Nov 2024
Quantifying and Enabling the Interpretability of CLIP-like Models
Avinash Madasu
Yossi Gandelsman
Vasudev Lal
Phillip Howard
VLM
231
3
0
10 Sep 2024
Multi-Object Hallucination in Vision-Language Models
Xuweiyi Chen
Ziqiao Ma
Xuejun Zhang
Sihan Xu
Shengyi Qian
Jianing Yang
David Fouhey
Joyce Chai
313
44
0
08 Jul 2024
Solution for SMART-101 Challenge of CVPR Multi-modal Algorithmic Reasoning Task 2024
Jinwoo Ahn
Junhyeok Park
Min-Jun Kim
Kang-Hyeon Kim
So-Yeong Sohn
Yun-Ji Lee
Du-Seong Chang
Yu-Jung Heo
Eun-Sol Kim
LRM
176
0
0
10 Jun 2024
Interpretable Tensor Fusion
Saurabh Varshneya
Antoine Ledent
Philipp Liznerski
Andriy Balinskyy
Purvanshi Mehta
Waleed Mustafa
Matthias Kirchler
171
3
0
07 May 2024
HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding
Linhui Xiao
Xiaoshan Yang
Fang Peng
Yaowei Wang
Changsheng Xu
ObjD
335
34
0
20 Apr 2024
INSPECT: Intrinsic and Systematic Probing Evaluation for Code Transformers
IEEE Transactions on Software Engineering (TSE), 2023
Anjan Karmakar
Romain Robbes
233
6
0
08 Dec 2023
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer
Jacob Zhiyuan Fang
Skyler Zheng
Vasu Sharma
Robinson Piramuthu
VLM
392
1
0
28 Nov 2023
Vision-Language Integration in Multimodal Video Transformers (Partially) Aligns with the Brain
Dota Tianai Dong
Mariya Toneva
189
8
0
13 Nov 2023
Dataset Bias Mitigation in Multiple-Choice Visual Question Answering and Beyond
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Zhecan Wang
Long Chen
Haoxuan You
Keyang Xu
Yicheng He
Wenhao Li
Noal Codella
Kai-Wei Chang
Shih-Fu Chang
350
7
0
23 Oct 2023
VDialogUE: A Unified Evaluation Benchmark for Visually-grounded Dialogue
Yunshui Li
Binyuan Hui
Zhaochao Yin
Wanwei He
Run Luo
Yuxing Long
Min Yang
Fei Huang
Yongbin Li
168
1
0
14 Sep 2023
Towards Vision-Language Mechanistic Interpretability: A Causal Tracing Tool for BLIP
Vedant Palit
Rohan Pandey
Aryaman Arora
Paul Pu Liang
275
46
0
27 Aug 2023
Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?
Haiwei Yang
Liang Ding
Jun Rao
Ye Liu
Li Shen
Changxing Ding
241
24
0
24 Aug 2023
Multimodal Explainable Artificial Intelligence: A Comprehensive Review of Methodological Advances and Future Research Directions
IEEE Access (IEEE Access), 2023
N. Rodis
Christos Sardianos
Panagiotis I. Radoglou-Grammatikis
Panagiotis G. Sarigiannidis
Iraklis Varlamis
Georgios Th. Papadopoulos
337
42
0
09 Jun 2023
Table and Image Generation for Investigating Knowledge of Entities in Pre-trained Vision and Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Hidetaka Kamigaito
Katsuhiko Hayashi
Taro Watanabe
VLM
180
1
0
03 Jun 2023
Measuring Progress in Fine-grained Vision-and-Language Understanding
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Emanuele Bugliarello
Laurent Sartran
Aishwarya Agrawal
Lisa Anne Hendricks
Aida Nematzadeh
VLM
235
31
0
12 May 2023
Prompt Pre-Training with Twenty-Thousand Classes for Open-Vocabulary Visual Recognition
Neural Information Processing Systems (NeurIPS), 2023
Shuhuai Ren
Aston Zhang
Yi Zhu
Shuai Zhang
Shuai Zheng
Mu Li
Alexander J. Smola
Xu Sun
VPVLM
VLM
239
41
0
10 Apr 2023
How Does Attention Work in Vision Transformers? A Visual Analytics Attempt
IEEE Transactions on Visualization and Computer Graphics (TVCG), 2023
Yiran Li
Junpeng Wang
Xin Dai
Liang Wang
Chin-Chia Michael Yeh
Yan Zheng
Wei Zhang
Kwan-Liu Ma
ViT
136
44
0
24 Mar 2023
Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding
Computer Vision and Pattern Recognition (CVPR), 2023
Morris Alper
Michael Fiman
Hadar Averbuch-Elor
VLM
LRM
244
17
0
21 Mar 2023
The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges
Maria Lymperaiou
Giorgos Stamou
VLM
238
5
0
04 Mar 2023
Controlling for Stereotypes in Multimodal Language Model Evaluation
BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP (BlackboxNLP), 2023
Manuj Malik
Richard Johansson
284
1
0
03 Feb 2023
MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models & Tasks
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Letitia Parcalabescu
Anette Frank
236
49
0
15 Dec 2022
A survey on knowledge-enhanced multimodal learning
Artificial Intelligence Review (Artif Intell Rev), 2022
Maria Lymperaiou
Giorgos Stamou
477
23
0
19 Nov 2022
Understanding ME? Multimodal Evaluation for Fine-grained Visual Commonsense
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Zhecan Wang
Haoxuan You
Yicheng He
Wenhao Li
Kai-Wei Chang
Shih-Fu Chang
278
6
0
10 Nov 2022
Understanding Cross-modal Interactions in V&L Models that Generate Scene Descriptions
Michele Cafagna
Kees van Deemter
Albert Gatt
CoGe
173
4
0
09 Nov 2022
DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-Attention
ACM Transactions on Knowledge Discovery from Data (TKDD), 2021
Fenglin Liu
Xian Wu
Shen Ge
Xuancheng Ren
Wei Fan
Xu Sun
Yuexian Zou
VLM
209
13
0
28 Oct 2022
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies?
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Mitja Nikolaus
Emmanuelle Salin
Stéphane Ayache
Abdellah Fourtassi
Benoit Favre
155
17
0
21 Oct 2022
Probing Cross-modal Semantics Alignment Capability from the Textual Perspective
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2022
Zheng Ma
Shi Zong
Mianzhi Pan
Jianbing Zhang
Shujian Huang
Xinyu Dai
Jiajun Chen
182
5
0
18 Oct 2022
EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Tiannan Wang
Wangchunshu Zhou
Yan Zeng
Xinsong Zhang
VLM
211
64
0
14 Oct 2022
One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks
Workshop on Representation Learning for NLP (RepL4NLP), 2022
Gregor Geigle
Chen Cecilia Liu
Jonas Pfeiffer
Iryna Gurevych
VLM
188
1
0
12 Oct 2022
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
ACM Computing Surveys (ACM CSUR), 2022
Paul Pu Liang
Amir Zadeh
Louis-Philippe Morency
315
174
0
07 Sep 2022
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
European Conference on Computer Vision (ECCV), 2022
Haoxuan You
Luowei Zhou
Bin Xiao
Noel Codella
Yu Cheng
Ruochen Xu
Shih-Fu Chang
Lu Yuan
CLIP
VLM
225
56
0
26 Jul 2022
Vision-and-Language Pretraining
Thong Nguyen
Cong-Duy Nguyen
Xiaobao Wu
See-Kiong Ng
Anh Tuan Luu
VLM
CLIP
282
2
0
05 Jul 2022
VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix
International Conference on Machine Learning (ICML), 2022
Teng Wang
Wenhao Jiang
Zhichao Lu
Feng Zheng
Ran Cheng
Chengguo Yin
Ping Luo
VLM
209
54
0
17 Jun 2022
Multimodal Learning with Transformers: A Survey
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Peng Xu
Xiatian Zhu
David Clifton
ViT
574
860
0
13 Jun 2022
Delving into the Openness of CLIP
Annual Meeting of the Association for Computational Linguistics (ACL), 2022
Shuhuai Ren
Lei Li
Xuancheng Ren
Guangxiang Zhao
Xu Sun
VLM
241
15
0
04 Jun 2022
VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
Wangchunshu Zhou
Yan Zeng
Shizhe Diao
Xinsong Zhang
CoGe
VLM
308
14
0
30 May 2022
Visualizing and Explaining Language Models
Adrian M. P. Braşoveanu
Razvan Andonie
MILM
VLM
330
7
0
30 Apr 2022
Reliable Visual Question Answering: Abstain Rather Than Answer Incorrectly
European Conference on Computer Vision (ECCV), 2022
Spencer Whitehead
Suzanne Petryk
Vedaad Shakib
Joseph E. Gonzalez
Trevor Darrell
Anna Rohrbach
Marcus Rohrbach
387
76
0
28 Apr 2022
Multimodal Adaptive Distillation for Leveraging Unimodal Encoders for Vision-Language Tasks
Zhecan Wang
Noel Codella
Yen-Chun Chen
Luowei Zhou
Xiyang Dai
...
Jianwei Yang
Haoxuan You
Kai-Wei Chang
Shih-Fu Chang
Lu Yuan
VLM
OffRL
260
27
0
22 Apr 2022
1
2
Next
Page 1 of 2