Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2210.01936
Cited By
When and why vision-language models behave like bags-of-words, and what to do about it?
4 October 2022
Mert Yuksekgonul
Federico Bianchi
Pratyusha Kalluri
Dan Jurafsky
James Y. Zou
VLM
CoGe
Re-assign community
ArXiv
PDF
HTML
Papers citing
"When and why vision-language models behave like bags-of-words, and what to do about it?"
50 / 285 papers shown
Title
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
Dongzhi Jiang
Guanglu Song
Xiaoshi Wu
Renrui Zhang
Dazhong Shen
Zhuofan Zong
Yu Liu
Hongsheng Li
VLM
30
20
0
04 Apr 2024
Is CLIP the main roadblock for fine-grained open-world perception?
Lorenzo Bianchi
F. Carrara
Nicola Messina
Fabrizio Falchi
VLM
30
4
0
04 Apr 2024
Iterated Learning Improves Compositionality in Large Vision-Language Models
Chenhao Zheng
Jieyu Zhang
Aniruddha Kembhavi
Ranjay Krishna
VLM
CoGe
41
9
0
02 Apr 2024
Evaluating Text-to-Visual Generation with Image-to-Text Generation
Zhiqiu Lin
Deepak Pathak
Baiqi Li
Jiayao Li
Xide Xia
Graham Neubig
Pengchuan Zhang
Deva Ramanan
EGVM
37
127
0
01 Apr 2024
Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
Rongjie Li
Yu Wu
Xuming He
MLLM
LRM
VLM
18
2
0
01 Apr 2024
Do Vision-Language Models Understand Compound Nouns?
Sonal Kumar
Sreyan Ghosh
S. Sakshi
Utkarsh Tyagi
Dinesh Manocha
CLIP
CoGe
VLM
64
0
0
30 Mar 2024
Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations
Jaisidh Singh
Ishaan Shrivastava
Mayank Vatsa
Richa Singh
Aparna Bharati
VLM
CoGe
24
14
0
29 Mar 2024
Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation
Yutong He
Alexander Robey
Naoki Murata
Yiding Jiang
J. Williams
George Pappas
Hamed Hassani
Yuki Mitsufuji
Ruslan Salakhutdinov
J. Zico Kolter
DiffM
91
4
0
28 Mar 2024
ShapeGrasp: Zero-Shot Task-Oriented Grasping with Large Language Models through Geometric Decomposition
Samuel Li
Sarthak Bhagat
Joseph Campbell
Yaqi Xie
Woojun Kim
Katia P. Sycara
Simon Stepputtis
LM&Ro
40
11
0
26 Mar 2024
Improving Text-to-Image Consistency via Automatic Prompt Optimization
Oscar Manas
Pietro Astolfi
Melissa Hall
Candace Ross
Jack Urbanek
Adina Williams
Aishwarya Agrawal
Adriana Romero Soriano
M. Drozdzal
29
27
0
26 Mar 2024
DreamLIP: Language-Image Pre-training with Long Captions
Kecheng Zheng
Yifei Zhang
Wei Wu
Fan Lu
Shuailei Ma
Xin Jin
Wei Chen
Yujun Shen
VLM
CLIP
32
24
0
25 Mar 2024
If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions
Reza Esfandiarpoor
Cristina Menghini
Stephen H. Bach
CoGe
VLM
27
8
0
25 Mar 2024
Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation
Yingshan Chang
Yasi Zhang
Zhiyuan Fang
Yingnian Wu
Yonatan Bisk
Feng Gao
EGVM
34
6
0
25 Mar 2024
Explore until Confident: Efficient Exploration for Embodied Question Answering
Allen Z. Ren
Jaden Clark
Anushri Dixit
Masha Itkina
Anirudha Majumdar
Dorsa Sadigh
40
28
0
23 Mar 2024
Can 3D Vision-Language Models Truly Understand Natural Language?
Weipeng Deng
Jihan Yang
Runyu Ding
Jiahui Liu
Yijiang Li
Xiaojuan Qi
Edith Ngai
32
4
0
21 Mar 2024
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
Fucai Ke
Zhixi Cai
Simindokht Jahangard
Weiqing Wang
P. D. Haghighi
Hamid Rezatofighi
LRM
38
9
0
19 Mar 2024
N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields
Yash Bhalgat
Iro Laina
João F. Henriques
Andrew Zisserman
Andrea Vedaldi
41
14
0
16 Mar 2024
Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation
Joseph Cho
Fachrina Dewi Puspitasari
Sheng Zheng
Jingyao Zheng
Lik-Hang Lee
Tae-Ho Kim
Choong Seon Hong
Chaoning Zhang
EGVM
VGen
36
40
0
08 Mar 2024
Enhancing Conceptual Understanding in Multimodal Contrastive Learning through Hard Negative Samples
Philipp J. Rösch
Norbert Oswald
Michaela Geierhos
Jindrich Libovický
28
3
0
05 Mar 2024
Differentially Private Representation Learning via Image Captioning
Tom Sander
Yaodong Yu
Maziar Sanjabi
Alain Durmus
Yi-An Ma
Kamalika Chaudhuri
Chuan Guo
48
3
0
04 Mar 2024
Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning
Maurits J. R. Bleeker
Mariya Hendriksen
Andrew Yates
Maarten de Rijke
VLM
38
3
0
27 Feb 2024
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing
Hyunjae Kim
Seunghyun Yoon
Trung Bui
Handong Zhao
Quan Tran
Franck Dernoncourt
Jaewoo Kang
CLIP
19
2
0
23 Feb 2024
CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models
Santiago Castro
Amir Ziai
Avneesh Saluja
Zhuoning Yuan
Rada Mihalcea
MLLM
CoGe
VLM
28
5
0
22 Feb 2024
CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples
Jianrui Zhang
Mu Cai
Tengyang Xie
Yong Jae Lee
LRM
32
18
0
20 Feb 2024
Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships
Sebastian Koch
Narunas Vaskevicius
Mirco Colosi
Pedro Hermosilla
Timo Ropinski
3DPC
28
25
0
19 Feb 2024
Cobra Effect in Reference-Free Image Captioning Metrics
Zheng Ma
Changxin Wang
Yawen Ouyang
Fei Zhao
Jianbing Zhang
Shujian Huang
Jiajun Chen
22
2
0
18 Feb 2024
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
Usha Bhalla
Alexander X. Oesterling
Suraj Srinivas
Flavio du Pin Calmon
Himabindu Lakkaraju
34
35
0
16 Feb 2024
Interpretable Measures of Conceptual Similarity by Complexity-Constrained Descriptive Auto-Encoding
Alessandro Achille
Greg Ver Steeg
Tian Yu Liu
Matthew Trager
Carson Klingenberg
Stefano Soatto
25
1
0
14 Feb 2024
Pixel Sentence Representation Learning
Chenghao Xiao
Zhuoxu Huang
Danlu Chen
G. Hudson
Yizhi Li
Haoran Duan
Chenghua Lin
Jie Fu
Jungong Han
Noura Al Moubayed
SSL
4
2
0
13 Feb 2024
ColorSwap: A Color and Word Order Dataset for Multimodal Evaluation
Jirayu Burapacheep
Ishan Gaur
Agam Bhatia
Tristan Thrush
19
4
0
07 Feb 2024
MouSi: Poly-Visual-Expert Vision-Language Models
Xiaoran Fan
Tao Ji
Changhao Jiang
Shuo Li
Senjie Jin
...
Qi Zhang
Xipeng Qiu
Xuanjing Huang
Zuxuan Wu
Yunchun Jiang
VLM
24
16
0
30 Jan 2024
A Survey on Generative AI and LLM for Video Generation, Understanding, and Streaming
Pengyuan Zhou
Lin Wang
Zhi Liu
Yanbin Hao
Pan Hui
Sasu Tarkoma
J. Kangasharju
VGen
34
26
0
30 Jan 2024
FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos
S. DarshanSingh
Zeeshan Khan
Makarand Tapaswi
VLM
CLIP
26
3
0
15 Jan 2024
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Shengbang Tong
Zhuang Liu
Yuexiang Zhai
Yi-An Ma
Yann LeCun
Saining Xie
VLM
MLLM
27
283
0
11 Jan 2024
I am a Strange Dataset: Metalinguistic Tests for Language Models
Tristan Thrush
Jared Moore
Miguel Monares
Christopher Potts
Douwe Kiela
14
5
0
10 Jan 2024
Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training
Longtian Qiu
Shan Ning
Xuming He
VLM
33
3
0
04 Jan 2024
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers
Aleksandar Stanić
Sergi Caelles
Michael Tschannen
LRM
VLM
23
9
0
03 Jan 2024
Generating Enhanced Negatives for Training Language-Based Object Detectors
Shiyu Zhao
Long Zhao
Vijay Kumar B.G
Yumin Suh
Dimitris N. Metaxas
Manmohan Chandraker
S. Schulter
ObjD
VLM
32
5
0
29 Dec 2023
3VL: Using Trees to Improve Vision-Language Models' Interpretability
Nir Yellinek
Leonid Karlinsky
Raja Giryes
CoGe
VLM
49
4
0
28 Dec 2023
Parrot Captions Teach CLIP to Spot Text
Yiqi Lin
Conghui He
Alex Jinpeng Wang
Bin Wang
Weijia Li
Mike Zheng Shou
20
7
0
21 Dec 2023
SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing
Zhecheng Wang
R. Prabha
Tianyuan Huang
Jiajun Wu
Ram Rajagopal
29
53
0
20 Dec 2023
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions
Jack Urbanek
Florian Bordes
Pietro Astolfi
Mary Williamson
Vasu Sharma
Adriana Romero Soriano
CLIP
3DV
28
41
0
14 Dec 2023
MAFA: Managing False Negatives for Vision-Language Pre-training
Jaeseok Byun
Dohoon Kim
Taesup Moon
VLM
13
3
0
11 Dec 2023
Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models
Ivan Kapelyukh
Yifei Ren
Ignacio Alzugaray
Edward Johns
VLM
LM&Ro
20
20
0
07 Dec 2023
FoMo Rewards: Can we cast foundation models as reward functions?
Ekdeep Singh Lubana
Johann Brehmer
P. D. Haan
Taco S. Cohen
OffRL
LRM
33
2
0
06 Dec 2023
A Contrastive Compositional Benchmark for Text-to-Image Synthesis: A Study with Unified Text-to-Image Fidelity Metrics
Xiangru Zhu
Penglei Sun
Chengyu Wang
Jingping Liu
Zhixu Li
Yanghua Xiao
Jun Huang
CoGe
100
5
0
04 Dec 2023
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Pavan Kumar Anasosalu Vasu
Hadi Pouransari
Fartash Faghri
Raviteja Vemulapalli
Oncel Tuzel
CLIP
VLM
24
43
0
28 Nov 2023
Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
Zeyu Han
Fangrui Zhu
Qianru Lao
Huaizu Jiang
ObjD
16
11
0
28 Nov 2023
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Chancharik Mitra
Brandon Huang
Trevor Darrell
Roei Herzig
MLLM
LRM
26
80
0
27 Nov 2023
Benchmarking Robustness of Text-Image Composed Retrieval
Shitong Sun
Jindong Gu
Shaogang Gong
CoGe
31
1
0
24 Nov 2023
Previous
1
2
3
4
5
6
Next