VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations

1 July 2022

Papers citing "VL-CheckList: Evaluating Pre-trained Vision-Language Models with Objects, Attributes and Relations"

50 / 69 papers shown

Title
ColorBench: Can VLMs See and Understand the Colorful World? A Comprehensive Benchmark for Color Perception, Reasoning, and Robustness Yijun Liang Ming Li Chenrui Fan Ziyue Li Dang Nguyen Kwesi Cobbina Shweta Bhardwaj Jiuhai Chen Fuxiao Liu Tianyi Zhou VLM CoGe 51 0 0 10 Apr 2025
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models Dahun Kim A. Piergiovanni Ganesh Mallya A. Angelova CoGe 41 0 0 04 Apr 2025
Dynamic Relation Inference via Verb Embeddings Omri Suissa Muhiim Ali Ariana Azarbal Hui Shen Shekhar Pradhan 46 0 0 17 Mar 2025
Is CLIP ideal? No. Can we fix it? Yes! Raphi Kang Yue Song Georgia Gkioxari Pietro Perona VLM 61 0 0 10 Mar 2025
Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data Haoxin Li Boyang Li CoGe 73 0 0 03 Mar 2025
CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation Reza Abbasi Ali Nazari Aminreza Sefid Mohammadali Banayeeanzade M. Rohban M. Baghshah VLM 86 1 0 27 Feb 2025
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs Jiarui Zhang Mahyar Khayatkhoei P. Chhikara Filip Ilievski LRM 39 6 0 24 Feb 2025
Object-centric Binding in Contrastive Language-Image Pretraining Rim Assouel Pietro Astolfi Florian Bordes M. Drozdzal Adriana Romero Soriano OCL VLM CoGe 103 0 0 19 Feb 2025
Real Classification by Description: Extending CLIP's Limits of Part Attributes Recognition Ethan Baron Idan Tankel Peter Tu Guy Ben-Yosef VLM 84 0 0 18 Dec 2024
Beyond Coarse-Grained Matching in Video-Text Retrieval Aozhu Chen Hazel Doughty Xirong Li Cees G. M. Snoek 34 0 0 16 Oct 2024
TULIP: Token-length Upgraded CLIP Ivona Najdenkoska Mohammad Mahdi Derakhshani Yuki M. Asano Nanne van Noord Marcel Worring Cees G. M. Snoek VLM 48 3 0 13 Oct 2024
Compositional Entailment Learning for Hyperbolic Vision-Language Models Avik Pal Max van Spengler Guido Maria DÁmely di Melendugno Alessandro Flaborea Fabio Galasso Pascal Mettes CoGe 48 5 0 09 Oct 2024
Removing Distributional Discrepancies in Captions Improves Image-Text Alignment Yuheng Li Haotian Liu Mu Cai Yijun Li Eli Shechtman Zhe Lin Yong Jae Lee Krishna Kumar Singh VLM 141 1 0 01 Oct 2024
Evaluating Attribute Comprehension in Large Vision-Language Models Haiwen Zhang Zixi Yang Yuanzhi Liu Xinran Wang Zheqi He Kongming Liang Zhanyu Ma ELM 29 0 0 25 Aug 2024
NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality Chaofan Tao Gukyeong Kwon Varad Gunjal Hao Yang Zhaowei Cai Yonatan Dukler Ashwin Swaminathan R. Manmatha Colin Jon Taylor Stefano Soatto CoGe 37 0 0 18 Aug 2024
GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language Models Ali Abdollahi Mahdi Ghaznavi Mohammad Reza Karimi Nejad Arash Mari Oriyad Reza Abbasi Ali Salesi Melika Behjati M. Rohban M. Baghshah CoGe 32 1 0 30 Jul 2024
Weak-to-Strong Compositional Learning from Generative Models for Language-based Object Detection Kwanyong Park Kuniaki Saito Donghyun Kim VLM CoGe 53 0 0 21 Jul 2024
SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations Sri Harsha Dumpala Aman Jaiswal Chandramouli Shama Sastry E. Milios Sageev Oore Hassan Sajjad CoGe 40 9 0 17 Jun 2024
Exploring the Spectrum of Visio-Linguistic Compositionality and Recognition Youngtaek Oh Pyunghwan Ahn Jinhyung Kim Gwangmo Song Soonyoung Lee In So Kweon Junmo Kim CoGe 45 2 0 13 Jun 2024
VISLA Benchmark: Evaluating Embedding Sensitivity to Semantic and Lexical Alterations Sri Harsha Dumpala Aman Jaiswal Chandramouli Shama Sastry E. Milios Sageev Oore Hassan Sajjad VLM CoGe 45 0 0 25 Apr 2024
SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision Ankit Vani Bac Nguyen Samuel Lavoie Ranjay Krishna Aaron Courville 39 1 0 24 Apr 2024
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction Hang Hua Jing Shi Kushal Kafle Simon Jenni Daoan Zhang John Collomosse Scott D. Cohen Jiebo Luo CoGe VLM 50 9 0 23 Apr 2024
Iterated Learning Improves Compositionality in Large Vision-Language Models Chenhao Zheng Jieyu Zhang Aniruddha Kembhavi Ranjay Krishna VLM CoGe 54 9 0 02 Apr 2024
Can 3D Vision-Language Models Truly Understand Natural Language? Weipeng Deng Jihan Yang Runyu Ding Jiahui Liu Yijiang Li Xiaojuan Qi Edith C.H. Ngai 37 4 0 21 Mar 2024
Towards Multimodal In-Context Learning for Vision & Language Models Sivan Doveh Shaked Perek M. Jehanzeb Mirza Wei Lin Amit Alfassy Assaf Arbelle S. Ullman Leonid Karlinsky VLM 114 14 0 19 Mar 2024
Enhancing Conceptual Understanding in Multimodal Contrastive Learning through Hard Negative Samples Philipp J. Rösch Norbert Oswald Michaela Geierhos Jindrich Libovický 42 3 0 05 Mar 2024
CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models Santiago Castro Amir Ziai Avneesh Saluja Zhuoning Yuan Rada Mihalcea MLLM CoGe VLM 34 5 0 22 Feb 2024
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers Aleksandar Stanić Sergi Caelles Michael Tschannen LRM VLM 27 9 0 03 Jan 2024
3VL: Using Trees to Improve Vision-Language Models' Interpretability Nir Yellinek Leonid Karlinsky Raja Giryes CoGe VLM 49 4 0 28 Dec 2023
GroundVLP: Harnessing Zero-shot Visual Grounding from Vision-Language Pre-training and Open-Vocabulary Object Detection Haozhan Shen Tiancheng Zhao Mingwei Zhu Jianwei Yin VLM ObjD 86 11 0 22 Dec 2023
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions Jack Urbanek Florian Bordes Pietro Astolfi Mary Williamson Vasu Sharma Adriana Romero Soriano CLIP 3DV 28 41 0 14 Dec 2023
Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding Wujian Peng Sicheng Xie Zuyao You Shiyi Lan Zuxuan Wu VLM CoGe MLLM 30 18 0 30 Nov 2023
Compositional Chain-of-Thought Prompting for Large Multimodal Models Chancharik Mitra Brandon Huang Trevor Darrell Roei Herzig MLLM LRM 36 80 0 27 Nov 2023
EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models Sijie Cheng Zhicheng Guo Jingwen Wu Kechen Fang Peng Li Huaping Liu Yang Liu EgoV LRM 31 16 0 27 Nov 2023
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation Yangyi Chen Xingyao Wang Manling Li Derek Hoiem Heng Ji 30 11 0 22 Nov 2023
SPOT! Revisiting Video-Language Models for Event Understanding Gengyuan Zhang Jinhe Bi Jindong Gu Yanyu Chen Volker Tresp 24 2 0 21 Nov 2023
Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining U. Sahin Hang Li Qadeer Ahmad Khan Daniel Cremers Volker Tresp VLM CoGe 28 12 0 07 Nov 2023
What's "up" with vision-language models? Investigating their struggle with spatial reasoning Amita Kamath Jack Hessel Kai-Wei Chang LRM CoGe 13 96 0 30 Oct 2023
Towards Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal LLMs Jiarui Zhang Mahyar Khayatkhoei P. Chhikara Filip Ilievski 37 2 0 24 Oct 2023
Benchmarking Sequential Visual Input Reasoning and Prediction in Multimodal Large Language Models Mingwei Zhu Leigang Sha Yu Shu Kangjia Zhao Tiancheng Zhao Jianwei Yin LRM 27 0 0 20 Oct 2023
Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models Vishaal Udandarao Max F. Burg Samuel Albanie Matthias Bethge VLM 34 9 0 12 Oct 2023
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning Mustafa Shukor Alexandre Ramé Corentin Dancette Matthieu Cord LRM MLLM 40 20 0 01 Oct 2023
How to Evaluate the Generalization of Detection? A Benchmark for Comprehensive Open-Vocabulary Detection Yi Yao Peng Liu Tiancheng Zhao Qianqian Zhang Jiajia Liao Chunxin Fang Kyusong Lee Qing Wang VLM ObjD 29 12 0 25 Aug 2023
Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining? Fei-Yue Wang Liang Ding Jun Rao Ye Liu Li Shen Changxing Ding 32 15 0 24 Aug 2023
An Examination of the Compositionality of Large Generative Vision-Language Models Teli Ma Rong Li Junwei Liang CoGe 34 2 0 21 Aug 2023
Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models Zheng Ma Mianzhi Pan Wenhan Wu Ka Leong Cheng Jianbing Zhang Shujian Huang Jiajun Chen VLM CoGe 23 3 0 06 Aug 2023
SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality Cheng-Yu Hsieh Jieyu Zhang Zixian Ma Aniruddha Kembhavi Ranjay Krishna CoGe 43 115 0 26 Jun 2023
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding Le Zhang Rabiul Awal Aishwarya Agrawal CoGe VLM 31 9 0 15 Jun 2023
Revisiting the Role of Language Priors in Vision-Language Models Zhiqiu Lin Xinyue Chen Deepak Pathak Pengchuan Zhang Deva Ramanan VLM 23 22 0 02 Jun 2023
Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models Sivan Doveh Assaf Arbelle Sivan Harary Roei Herzig Donghyun Kim ... Rameswar Panda Raja Giryes Rogerio Feris S. Ullman Leonid Karlinsky VLM CoGe 41 52 0 31 May 2023