Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2210.01936
Cited By
When and why vision-language models behave like bags-of-words, and what to do about it?
4 October 2022
Mert Yuksekgonul
Federico Bianchi
Pratyusha Kalluri
Dan Jurafsky
James Y. Zou
VLM
CoGe
Re-assign community
ArXiv
PDF
HTML
Papers citing
"When and why vision-language models behave like bags-of-words, and what to do about it?"
35 / 285 papers shown
Title
Transferring Visual Attributes from Natural Language to Verified Image Generation
Rodrigo Valerio
João Bordalo
Michal Yarom
Yonattan Bitton
Idan Szpektor
João Magalhães
24
5
0
24 May 2023
An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics
Saba Ahmadi
Aishwarya Agrawal
17
6
0
24 May 2023
Prompting Language-Informed Distribution for Compositional Zero-Shot Learning
Wentao Bao
Lichang Chen
Heng-Chiao Huang
Yu Kong
CoGe
VLM
27
11
0
23 May 2023
Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining
Emanuele Bugliarello
Aida Nematzadeh
Lisa Anne Hendricks
SSL
22
5
0
23 May 2023
LLMScore: Unveiling the Power of Large Language Models in Text-to-Image Synthesis Evaluation
Yujie Lu
Xianjun Yang
Xiujun Li
X. Wang
William Yang Wang
EGVM
44
35
0
18 May 2023
Paxion: Patching Action Knowledge in Video-Language Foundation Models
Zhenhailong Wang
Ansel Blume
Sha Li
Genglin Liu
Jaemin Cho
Zineng Tang
Mohit Bansal
Heng Ji
KELM
VGen
17
26
0
18 May 2023
What You See is What You Read? Improving Text-Image Alignment Evaluation
Michal Yarom
Yonatan Bitton
Soravit Changpinyo
Roee Aharoni
Jonathan Herzig
Oran Lang
E. Ofek
Idan Szpektor
EGVM
33
73
0
17 May 2023
Measuring Progress in Fine-grained Vision-and-Language Understanding
Emanuele Bugliarello
Laurent Sartran
Aishwarya Agrawal
Lisa Anne Hendricks
Aida Nematzadeh
VLM
25
22
0
12 May 2023
An Inverse Scaling Law for CLIP Training
Xianhang Li
Zeyu Wang
Cihang Xie
VLM
CLIP
35
54
0
11 May 2023
Incorporating Structured Representations into Pretrained Vision & Language Models Using Scene Graphs
Roei Herzig
Alon Mendelson
Leonid Karlinsky
Assaf Arbelle
Rogerio Feris
Trevor Darrell
Amir Globerson
VLM
23
31
0
10 May 2023
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations
Yufen Huang
Jiji Tang
Zhuo Chen
Rongsheng Zhang
Xinfeng Zhang
...
Zeng Zhao
Zhou Zhao
Tangjie Lv
Zhipeng Hu
Wen Zhang
VLM
12
21
0
06 May 2023
COLA: A Benchmark for Compositional Text-to-image Retrieval
Arijit Ray
Filip Radenovic
Abhimanyu Dubey
Bryan A. Plummer
Ranjay Krishna
Kate Saenko
CoGe
VLM
35
34
0
05 May 2023
TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis
Mathis Petrovich
Michael J. Black
Gül Varol
VGen
62
74
0
02 May 2023
RoCOCO: Robustness Benchmark of MS-COCO to Stress-test Image-Text Matching Models
Seulki Park
Daeho Um
Hajung Yoon
Sanghyuk Chun
Sangdoo Yun
Jin Young Choi
25
2
0
21 Apr 2023
Verbs in Action: Improving verb understanding in video-language models
Liliane Momeni
Mathilde Caron
Arsha Nagrani
Andrew Zisserman
Cordelia Schmid
30
70
0
13 Apr 2023
HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models
Eslam Mohamed Bakr
Pengzhan Sun
Xiaoqian Shen
Faizan Farooq Khan
Li Erran Li
Mohamed Elhoseiny
VLM
11
76
0
11 Apr 2023
Probing Conceptual Understanding of Large Visual-Language Models
Madeline Chantry Schiappa
Raiyaan Abdullah
Shehreen Azad
Jared Claypoole
Michael Cogswell
Ajay Divakaran
Y. S. Rawat
35
13
0
07 Apr 2023
The Vector Grounding Problem
Dimitri Coelho Mollo
Raphael Milliere
23
26
0
04 Apr 2023
Self-Supervised Multimodal Learning: A Survey
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
19
43
0
31 Mar 2023
Going Beyond Nouns With Vision & Language Models Using Synthetic Data
Paola Cascante-Bonilla
Khaled Shehada
James Smith
Sivan Doveh
Donghyun Kim
...
Gül Varol
A. Oliva
Vicente Ordonez
Rogerio Feris
Leonid Karlinsky
VLM
SyDa
22
40
0
30 Mar 2023
Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection
Kyle Buettner
Adriana Kovashka
20
0
0
17 Mar 2023
MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge
Wei Lin
Leonid Karlinsky
Nina Shvetsova
Horst Possegger
Mateusz Koziñski
Rameswar Panda
Rogerio Feris
Hilde Kuehne
Horst Bischof
VLM
100
38
0
15 Mar 2023
Computational Language Acquisition with Theory of Mind
Andy Liu
Hao Zhu
Emmy Liu
Yonatan Bisk
Graham Neubig
LLMAG
AI4CE
16
18
0
02 Mar 2023
Aligning Bag of Regions for Open-Vocabulary Object Detection
Size Wu
Wenwei Zhang
Sheng Jin
Wentao Liu
Chen Change Loy
VLM
ObjD
42
108
0
27 Feb 2023
Does CLIP Bind Concepts? Probing Compositionality in Large Image Models
Martha Lewis
Nihal V. Nayak
Peilin Yu
Qinan Yu
Jack Merullo
Stephen H. Bach
Ellie Pavlick
VLM
OCL
CoGe
14
59
0
20 Dec 2022
Why is Winoground Hard? Investigating Failures in Visuolinguistic Compositionality
Anuj Diwan
Layne Berry
Eunsol Choi
David F. Harwath
Kyle Mahowald
CoGe
101
41
0
01 Nov 2022
Semantic Abstraction: Open-World 3D Scene Understanding from 2D Vision-Language Models
Huy Ha
Shuran Song
LM&Ro
VLM
28
101
0
23 Jul 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li
Dongxu Li
Caiming Xiong
S. Hoi
MLLM
BDL
VLM
CLIP
390
4,110
0
28 Jan 2022
Understanding and Improving Robustness of Vision Transformers through Patch-based Negative Augmentation
Yao Qin
Chiyuan Zhang
Ting Chen
Balaji Lakshminarayanan
Alex Beutel
Xuezhi Wang
ViT
39
42
0
15 Oct 2021
COVR: A test-bed for Visually Grounded Compositional Generalization with real images
Ben Bogin
Shivanshu Gupta
Matt Gardner
Jonathan Berant
CoGe
34
29
0
22 Sep 2021
Contrastive Language-Image Pre-training for the Italian Language
Federico Bianchi
Giuseppe Attanasio
Raphael Pisoni
Silvia Terragni
Gabriele Sarti
S. Lakshmi
VLM
CLIP
16
29
0
19 Aug 2021
Mitigating Dataset Harms Requires Stewardship: Lessons from 1000 Papers
Kenny Peng
Arunesh Mathur
Arvind Narayanan
97
91
0
06 Aug 2021
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Chao Jia
Yinfei Yang
Ye Xia
Yi-Ting Chen
Zarana Parekh
Hieu H. Pham
Quoc V. Le
Yun-hsuan Sung
Zhen Li
Tom Duerig
VLM
CLIP
293
3,683
0
11 Feb 2021
Out of Order: How Important Is The Sequential Order of Words in a Sentence in Natural Language Understanding Tasks?
Thang M. Pham
Trung Bui
Long Mai
Anh Totti Nguyen
207
122
0
30 Dec 2020
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
294
6,943
0
20 Apr 2018
Previous
1
2
3
4
5
6