ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2210.01936
  4. Cited By
When and why vision-language models behave like bags-of-words, and what
  to do about it?

When and why vision-language models behave like bags-of-words, and what to do about it?

4 October 2022
Mert Yuksekgonul
Federico Bianchi
Pratyusha Kalluri
Dan Jurafsky
James Y. Zou
    VLM
    CoGe
ArXivPDFHTML

Papers citing "When and why vision-language models behave like bags-of-words, and what to do about it?"

50 / 285 papers shown
Title
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept
  Matching
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
Dongzhi Jiang
Guanglu Song
Xiaoshi Wu
Renrui Zhang
Dazhong Shen
Zhuofan Zong
Yu Liu
Hongsheng Li
VLM
30
20
0
04 Apr 2024
Is CLIP the main roadblock for fine-grained open-world perception?
Is CLIP the main roadblock for fine-grained open-world perception?
Lorenzo Bianchi
F. Carrara
Nicola Messina
Fabrizio Falchi
VLM
30
4
0
04 Apr 2024
Iterated Learning Improves Compositionality in Large Vision-Language
  Models
Iterated Learning Improves Compositionality in Large Vision-Language Models
Chenhao Zheng
Jieyu Zhang
Aniruddha Kembhavi
Ranjay Krishna
VLM
CoGe
41
9
0
02 Apr 2024
Evaluating Text-to-Visual Generation with Image-to-Text Generation
Evaluating Text-to-Visual Generation with Image-to-Text Generation
Zhiqiu Lin
Deepak Pathak
Baiqi Li
Jiayao Li
Xide Xia
Graham Neubig
Pengchuan Zhang
Deva Ramanan
EGVM
37
127
0
01 Apr 2024
Learning by Correction: Efficient Tuning Task for Zero-Shot Generative
  Vision-Language Reasoning
Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
Rongjie Li
Yu Wu
Xuming He
MLLM
LRM
VLM
18
2
0
01 Apr 2024
Do Vision-Language Models Understand Compound Nouns?
Do Vision-Language Models Understand Compound Nouns?
Sonal Kumar
Sreyan Ghosh
S. Sakshi
Utkarsh Tyagi
Dinesh Manocha
CLIP
CoGe
VLM
64
0
0
30 Mar 2024
Learn "No" to Say "Yes" Better: Improving Vision-Language Models via
  Negations
Learn "No" to Say "Yes" Better: Improving Vision-Language Models via Negations
Jaisidh Singh
Ishaan Shrivastava
Mayank Vatsa
Richa Singh
Aparna Bharati
VLM
CoGe
24
14
0
29 Mar 2024
Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation
Automated Black-box Prompt Engineering for Personalized Text-to-Image Generation
Yutong He
Alexander Robey
Naoki Murata
Yiding Jiang
J. Williams
George Pappas
Hamed Hassani
Yuki Mitsufuji
Ruslan Salakhutdinov
J. Zico Kolter
DiffM
91
4
0
28 Mar 2024
ShapeGrasp: Zero-Shot Task-Oriented Grasping with Large Language Models
  through Geometric Decomposition
ShapeGrasp: Zero-Shot Task-Oriented Grasping with Large Language Models through Geometric Decomposition
Samuel Li
Sarthak Bhagat
Joseph Campbell
Yaqi Xie
Woojun Kim
Katia P. Sycara
Simon Stepputtis
LM&Ro
40
11
0
26 Mar 2024
Improving Text-to-Image Consistency via Automatic Prompt Optimization
Improving Text-to-Image Consistency via Automatic Prompt Optimization
Oscar Manas
Pietro Astolfi
Melissa Hall
Candace Ross
Jack Urbanek
Adina Williams
Aishwarya Agrawal
Adriana Romero Soriano
M. Drozdzal
29
27
0
26 Mar 2024
DreamLIP: Language-Image Pre-training with Long Captions
DreamLIP: Language-Image Pre-training with Long Captions
Kecheng Zheng
Yifei Zhang
Wei Wu
Fan Lu
Shuailei Ma
Xin Jin
Wei Chen
Yujun Shen
VLM
CLIP
32
24
0
25 Mar 2024
If CLIP Could Talk: Understanding Vision-Language Model Representations
  Through Their Preferred Concept Descriptions
If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions
Reza Esfandiarpoor
Cristina Menghini
Stephen H. Bach
CoGe
VLM
27
8
0
25 Mar 2024
Skews in the Phenomenon Space Hinder Generalization in Text-to-Image
  Generation
Skews in the Phenomenon Space Hinder Generalization in Text-to-Image Generation
Yingshan Chang
Yasi Zhang
Zhiyuan Fang
Yingnian Wu
Yonatan Bisk
Feng Gao
EGVM
34
6
0
25 Mar 2024
Explore until Confident: Efficient Exploration for Embodied Question
  Answering
Explore until Confident: Efficient Exploration for Embodied Question Answering
Allen Z. Ren
Jaden Clark
Anushri Dixit
Masha Itkina
Anirudha Majumdar
Dorsa Sadigh
40
28
0
23 Mar 2024
Can 3D Vision-Language Models Truly Understand Natural Language?
Can 3D Vision-Language Models Truly Understand Natural Language?
Weipeng Deng
Jihan Yang
Runyu Ding
Jiahui Liu
Yijiang Li
Xiaojuan Qi
Edith Ngai
32
4
0
21 Mar 2024
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
HYDRA: A Hyper Agent for Dynamic Compositional Visual Reasoning
Fucai Ke
Zhixi Cai
Simindokht Jahangard
Weiqing Wang
P. D. Haghighi
Hamid Rezatofighi
LRM
38
9
0
19 Mar 2024
N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields
N2F2: Hierarchical Scene Understanding with Nested Neural Feature Fields
Yash Bhalgat
Iro Laina
João F. Henriques
Andrew Zisserman
Andrea Vedaldi
41
14
0
16 Mar 2024
Sora as an AGI World Model? A Complete Survey on Text-to-Video
  Generation
Sora as an AGI World Model? A Complete Survey on Text-to-Video Generation
Joseph Cho
Fachrina Dewi Puspitasari
Sheng Zheng
Jingyao Zheng
Lik-Hang Lee
Tae-Ho Kim
Choong Seon Hong
Chaoning Zhang
EGVM
VGen
36
40
0
08 Mar 2024
Enhancing Conceptual Understanding in Multimodal Contrastive Learning
  through Hard Negative Samples
Enhancing Conceptual Understanding in Multimodal Contrastive Learning through Hard Negative Samples
Philipp J. Rösch
Norbert Oswald
Michaela Geierhos
Jindrich Libovický
28
3
0
05 Mar 2024
Differentially Private Representation Learning via Image Captioning
Differentially Private Representation Learning via Image Captioning
Tom Sander
Yaodong Yu
Maziar Sanjabi
Alain Durmus
Yi-An Ma
Kamalika Chaudhuri
Chuan Guo
48
3
0
04 Mar 2024
Demonstrating and Reducing Shortcuts in Vision-Language Representation
  Learning
Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning
Maurits J. R. Bleeker
Mariya Hendriksen
Andrew Yates
Maarten de Rijke
VLM
38
3
0
27 Feb 2024
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing
Fine-tuning CLIP Text Encoders with Two-step Paraphrasing
Hyunjae Kim
Seunghyun Yoon
Trung Bui
Handong Zhao
Quan Tran
Franck Dernoncourt
Jaewoo Kang
CLIP
19
2
0
23 Feb 2024
CLoVe: Encoding Compositional Language in Contrastive Vision-Language
  Models
CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models
Santiago Castro
Amir Ziai
Avneesh Saluja
Zhuoning Yuan
Rada Mihalcea
MLLM
CoGe
VLM
28
5
0
22 Feb 2024
CounterCurate: Enhancing Physical and Semantic Visio-Linguistic
  Compositional Reasoning via Counterfactual Examples
CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples
Jianrui Zhang
Mu Cai
Tengyang Xie
Yong Jae Lee
LRM
32
18
0
20 Feb 2024
Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with
  Queryable Objects and Open-Set Relationships
Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships
Sebastian Koch
Narunas Vaskevicius
Mirco Colosi
Pedro Hermosilla
Timo Ropinski
3DPC
28
25
0
19 Feb 2024
Cobra Effect in Reference-Free Image Captioning Metrics
Cobra Effect in Reference-Free Image Captioning Metrics
Zheng Ma
Changxin Wang
Yawen Ouyang
Fei Zhao
Jianbing Zhang
Shujian Huang
Jiajun Chen
22
2
0
18 Feb 2024
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
Interpreting CLIP with Sparse Linear Concept Embeddings (SpLiCE)
Usha Bhalla
Alexander X. Oesterling
Suraj Srinivas
Flavio du Pin Calmon
Himabindu Lakkaraju
34
35
0
16 Feb 2024
Interpretable Measures of Conceptual Similarity by
  Complexity-Constrained Descriptive Auto-Encoding
Interpretable Measures of Conceptual Similarity by Complexity-Constrained Descriptive Auto-Encoding
Alessandro Achille
Greg Ver Steeg
Tian Yu Liu
Matthew Trager
Carson Klingenberg
Stefano Soatto
25
1
0
14 Feb 2024
Pixel Sentence Representation Learning
Pixel Sentence Representation Learning
Chenghao Xiao
Zhuoxu Huang
Danlu Chen
G. Hudson
Yizhi Li
Haoran Duan
Chenghua Lin
Jie Fu
Jungong Han
Noura Al Moubayed
SSL
4
2
0
13 Feb 2024
ColorSwap: A Color and Word Order Dataset for Multimodal Evaluation
ColorSwap: A Color and Word Order Dataset for Multimodal Evaluation
Jirayu Burapacheep
Ishan Gaur
Agam Bhatia
Tristan Thrush
19
4
0
07 Feb 2024
MouSi: Poly-Visual-Expert Vision-Language Models
MouSi: Poly-Visual-Expert Vision-Language Models
Xiaoran Fan
Tao Ji
Changhao Jiang
Shuo Li
Senjie Jin
...
Qi Zhang
Xipeng Qiu
Xuanjing Huang
Zuxuan Wu
Yunchun Jiang
VLM
24
16
0
30 Jan 2024
A Survey on Generative AI and LLM for Video Generation, Understanding,
  and Streaming
A Survey on Generative AI and LLM for Video Generation, Understanding, and Streaming
Pengyuan Zhou
Lin Wang
Zhi Liu
Yanbin Hao
Pan Hui
Sasu Tarkoma
J. Kangasharju
VGen
34
26
0
30 Jan 2024
FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos
FiGCLIP: Fine-Grained CLIP Adaptation via Densely Annotated Videos
S. DarshanSingh
Zeeshan Khan
Makarand Tapaswi
VLM
CLIP
26
3
0
15 Jan 2024
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
Shengbang Tong
Zhuang Liu
Yuexiang Zhai
Yi-An Ma
Yann LeCun
Saining Xie
VLM
MLLM
27
283
0
11 Jan 2024
I am a Strange Dataset: Metalinguistic Tests for Language Models
I am a Strange Dataset: Metalinguistic Tests for Language Models
Tristan Thrush
Jared Moore
Miguel Monares
Christopher Potts
Douwe Kiela
14
5
0
10 Jan 2024
Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via
  Text-Only Training
Mining Fine-Grained Image-Text Alignment for Zero-Shot Captioning via Text-Only Training
Longtian Qiu
Shan Ning
Xuming He
VLM
33
3
0
04 Jan 2024
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as
  Programmers
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers
Aleksandar Stanić
Sergi Caelles
Michael Tschannen
LRM
VLM
23
9
0
03 Jan 2024
Generating Enhanced Negatives for Training Language-Based Object
  Detectors
Generating Enhanced Negatives for Training Language-Based Object Detectors
Shiyu Zhao
Long Zhao
Vijay Kumar B.G
Yumin Suh
Dimitris N. Metaxas
Manmohan Chandraker
S. Schulter
ObjD
VLM
32
5
0
29 Dec 2023
3VL: Using Trees to Improve Vision-Language Models' Interpretability
3VL: Using Trees to Improve Vision-Language Models' Interpretability
Nir Yellinek
Leonid Karlinsky
Raja Giryes
CoGe
VLM
49
4
0
28 Dec 2023
Parrot Captions Teach CLIP to Spot Text
Parrot Captions Teach CLIP to Spot Text
Yiqi Lin
Conghui He
Alex Jinpeng Wang
Bin Wang
Weijia Li
Mike Zheng Shou
20
7
0
21 Dec 2023
SkyScript: A Large and Semantically Diverse Vision-Language Dataset for
  Remote Sensing
SkyScript: A Large and Semantically Diverse Vision-Language Dataset for Remote Sensing
Zhecheng Wang
R. Prabha
Tianyuan Huang
Jiajun Wu
Ram Rajagopal
29
53
0
20 Dec 2023
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style
  Models on Dense Captions
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions
Jack Urbanek
Florian Bordes
Pietro Astolfi
Mary Williamson
Vasu Sharma
Adriana Romero Soriano
CLIP
3DV
28
41
0
14 Dec 2023
MAFA: Managing False Negatives for Vision-Language Pre-training
MAFA: Managing False Negatives for Vision-Language Pre-training
Jaeseok Byun
Dohoon Kim
Taesup Moon
VLM
13
3
0
11 Dec 2023
Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language
  Models
Dream2Real: Zero-Shot 3D Object Rearrangement with Vision-Language Models
Ivan Kapelyukh
Yifei Ren
Ignacio Alzugaray
Edward Johns
VLM
LM&Ro
20
20
0
07 Dec 2023
FoMo Rewards: Can we cast foundation models as reward functions?
FoMo Rewards: Can we cast foundation models as reward functions?
Ekdeep Singh Lubana
Johann Brehmer
P. D. Haan
Taco S. Cohen
OffRL
LRM
33
2
0
06 Dec 2023
A Contrastive Compositional Benchmark for Text-to-Image Synthesis: A
  Study with Unified Text-to-Image Fidelity Metrics
A Contrastive Compositional Benchmark for Text-to-Image Synthesis: A Study with Unified Text-to-Image Fidelity Metrics
Xiangru Zhu
Penglei Sun
Chengyu Wang
Jingping Liu
Zhixu Li
Yanghua Xiao
Jun Huang
CoGe
100
5
0
04 Dec 2023
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced
  Training
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Pavan Kumar Anasosalu Vasu
Hadi Pouransari
Fartash Faghri
Raviteja Vemulapalli
Oncel Tuzel
CLIP
VLM
24
43
0
28 Nov 2023
Zero-shot Referring Expression Comprehension via Structural Similarity
  Between Images and Captions
Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions
Zeyu Han
Fangrui Zhu
Qianru Lao
Huaizu Jiang
ObjD
16
11
0
28 Nov 2023
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Chancharik Mitra
Brandon Huang
Trevor Darrell
Roei Herzig
MLLM
LRM
26
80
0
27 Nov 2023
Benchmarking Robustness of Text-Image Composed Retrieval
Benchmarking Robustness of Text-Image Composed Retrieval
Shitong Sun
Jindong Gu
Shaogang Gong
CoGe
31
1
0
24 Nov 2023
Previous
123456
Next