When and why vision-language models behave like bags-of-words, and what to do about it?

4 October 2022

Mert Yuksekgonul

Dan Jurafsky

Papers citing "When and why vision-language models behave like bags-of-words, and what to do about it?"

35 / 285 papers shown

Title
Transferring Visual Attributes from Natural Language to Verified Image Generation Rodrigo Valerio João Bordalo Michal Yarom Yonattan Bitton Idan Szpektor João Magalhães 24 5 0 24 May 2023
An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics Saba Ahmadi Aishwarya Agrawal 17 6 0 24 May 2023
Prompting Language-Informed Distribution for Compositional Zero-Shot Learning Wentao Bao Lichang Chen Heng-Chiao Huang Yu Kong CoGe VLM 27 11 0 23 May 2023
Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining Emanuele Bugliarello Aida Nematzadeh Lisa Anne Hendricks SSL 22 5 0 23 May 2023
LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation Yujie Lu Xianjun Yang Xiujun Li X. Wang William Yang Wang EGVM 44 35 0 18 May 2023
Paxion: Patching Action Knowledge in Video-Language Foundation Models Zhenhailong Wang Ansel Blume Sha Li Genglin Liu Jaemin Cho Zineng Tang Mohit Bansal Heng Ji KELM VGen 17 26 0 18 May 2023
What You See is What You Read? Improving Text-Image Alignment Evaluation Michal Yarom Yonatan Bitton Soravit Changpinyo Roee Aharoni Jonathan Herzig Oran Lang E. Ofek Idan Szpektor EGVM 33 73 0 17 May 2023
Measuring Progress in Fine-grained Vision-and-Language Understanding Emanuele Bugliarello Laurent Sartran Aishwarya Agrawal Lisa Anne Hendricks Aida Nematzadeh VLM 25 22 0 12 May 2023
An Inverse Scaling Law for CLIP Training Xianhang Li Zeyu Wang Cihang Xie VLM CLIP 35 54 0 11 May 2023
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs Roei Herzig Alon Mendelson Leonid Karlinsky Assaf Arbelle Rogerio Feris Trevor Darrell Amir Globerson VLM 23 31 0 10 May 2023
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations Yufen Huang Jiji Tang Zhuo Chen Rongsheng Zhang Xinfeng Zhang ... Zeng Zhao Zhou Zhao Tangjie Lv Zhipeng Hu Wen Zhang VLM 12 21 0 06 May 2023
COLA: A Benchmark for Compositional Text-to-image Retrieval Arijit Ray Filip Radenovic Abhimanyu Dubey Bryan A. Plummer Ranjay Krishna Kate Saenko CoGe VLM 35 34 0 05 May 2023
TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis Mathis Petrovich Michael J. Black Gül Varol VGen 62 74 0 02 May 2023
RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models Seulki Park Daeho Um Hajung Yoon Sanghyuk Chun Sangdoo Yun Jin Young Choi 25 2 0 21 Apr 2023
Verbs in Action: Improving verb understanding in video-language models Liliane Momeni Mathilde Caron Arsha Nagrani Andrew Zisserman Cordelia Schmid 30 70 0 13 Apr 2023
HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models Eslam Mohamed Bakr Pengzhan Sun Xiaoqian Shen Faizan Farooq Khan Li Erran Li Mohamed Elhoseiny VLM 11 76 0 11 Apr 2023
Probing Conceptual Understanding of Large Visual-Language Models Madeline Chantry Schiappa Raiyaan Abdullah Shehreen Azad Jared Claypoole Michael Cogswell Ajay Divakaran Y. S. Rawat 35 13 0 07 Apr 2023
The Vector Grounding Problem Dimitri Coelho Mollo Raphael Milliere 23 26 0 04 Apr 2023
Self-Supervised Multimodal Learning: A Survey Yongshuo Zong Oisin Mac Aodha Timothy M. Hospedales SSL 19 43 0 31 Mar 2023
Going Beyond Nouns With Vision & Language Models Using Synthetic Data Paola Cascante-Bonilla Khaled Shehada James Smith Sivan Doveh Donghyun Kim ... Gül Varol A. Oliva Vicente Ordonez Rogerio Feris Leonid Karlinsky VLM SyDa 22 40 0 30 Mar 2023
Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection Kyle Buettner Adriana Kovashka 20 0 0 17 Mar 2023
MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge Wei Lin Leonid Karlinsky Nina Shvetsova Horst Possegger Mateusz Koziñski Rameswar Panda Rogerio Feris Hilde Kuehne Horst Bischof VLM 100 38 0 15 Mar 2023
Computational Language Acquisition with Theory of Mind Andy Liu Hao Zhu Emmy Liu Yonatan Bisk Graham Neubig LLMAG AI4CE 16 18 0 02 Mar 2023
Aligning Bag of Regions for Open-Vocabulary Object Detection Size Wu Wenwei Zhang Sheng Jin Wentao Liu Chen Change Loy VLM ObjD 42 108 0 27 Feb 2023
Does CLIP Bind Concepts? Probing Compositionality in Large Image Models Martha Lewis Nihal V. Nayak Peilin Yu Qinan Yu Jack Merullo Stephen H. Bach Ellie Pavlick VLM OCL CoGe 14 59 0 20 Dec 2022
Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality Anuj Diwan Layne Berry Eunsol Choi David F. Harwath Kyle Mahowald CoGe 101 41 0 01 Nov 2022
Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models Huy Ha Shuran Song LM&Ro VLM 28 101 0 23 Jul 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation Junnan Li Dongxu Li Caiming Xiong S. Hoi MLLM BDL VLM CLIP 390 4,110 0 28 Jan 2022
Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation Yao Qin Chiyuan Zhang Ting Chen Balaji Lakshminarayanan Alex Beutel Xuezhi Wang ViT 39 42 0 15 Oct 2021
COVR: A test-bed for Visually Grounded Compositional Generalization with real images Ben Bogin Shivanshu Gupta Matt Gardner Jonathan Berant CoGe 34 29 0 22 Sep 2021
Contrastive Language-Image Pre-training for the Italian Language Federico Bianchi Giuseppe Attanasio Raphael Pisoni Silvia Terragni Gabriele Sarti S. Lakshmi VLM CLIP 16 29 0 19 Aug 2021
Mitigating Dataset Harms Requires Stewardship: Lessons from 1000 Papers Kenny Peng Arunesh Mathur Arvind Narayanan 97 91 0 06 Aug 2021
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision Chao Jia Yinfei Yang Ye Xia Yi-Ting Chen Zarana Parekh Hieu H. Pham Quoc V. Le Yun-hsuan Sung Zhen Li Tom Duerig VLM CLIP 293 3,683 0 11 Feb 2021
Out of Order: How Important Is The Sequential Order of Words in a Sentence in Natural Language Understanding Tasks? Thang M. Pham Trung Bui Long Mai Anh Totti Nguyen 207 122 0 30 Dec 2020
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding Alex Jinpeng Wang Amanpreet Singh Julian Michael Felix Hill Omer Levy Samuel R. Bowman ELM 294 6,943 0 20 Apr 2018