ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2210.01936
  4. Cited By
When and why vision-language models behave like bags-of-words, and what
  to do about it?

When and why vision-language models behave like bags-of-words, and what to do about it?

4 October 2022
Mert Yuksekgonul
Federico Bianchi
Pratyusha Kalluri
Dan Jurafsky
James Y. Zou
    VLM
    CoGe
ArXivPDFHTML

Papers citing "When and why vision-language models behave like bags-of-words, and what to do about it?"

50 / 285 papers shown
Title
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided
  Code-Vision Representation
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation
Yangyi Chen
Xingyao Wang
Manling Li
Derek Hoiem
Heng Ji
28
11
0
22 Nov 2023
SPOT! Revisiting Video-Language Models for Event Understanding
SPOT! Revisiting Video-Language Models for Event Understanding
Gengyuan Zhang
Jinhe Bi
Jindong Gu
Yanyu Chen
Volker Tresp
19
1
0
21 Nov 2023
What's left can't be right -- The remaining positional incompetence of
  contrastive vision-language models
What's left can't be right -- The remaining positional incompetence of contrastive vision-language models
Nils Hoehing
Ellen Rushe
Anthony Ventresque
VLM
13
2
0
20 Nov 2023
SelfEval: Leveraging the discriminative nature of generative models for
  evaluation
SelfEval: Leveraging the discriminative nature of generative models for evaluation
Sai Saketh Rambhatla
Ishan Misra
EGVM
25
4
0
17 Nov 2023
VideoCon: Robust Video-Language Alignment via Contrast Captions
VideoCon: Robust Video-Language Alignment via Contrast Captions
Hritik Bansal
Yonatan Bitton
Idan Szpektor
Kai-Wei Chang
Aditya Grover
28
14
0
15 Nov 2023
Enhancing Multimodal Compositional Reasoning of Visual Language Models
  with Generative Negative Mining
Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining
U. Sahin
Hang Li
Qadeer Ahmad Khan
Daniel Cremers
Volker Tresp
VLM
CoGe
23
12
0
07 Nov 2023
CoVLM: Composing Visual Entities and Relationships in Large Language
  Models Via Communicative Decoding
CoVLM: Composing Visual Entities and Relationships in Large Language Models Via Communicative Decoding
Junyan Li
Delin Chen
Yining Hong
Zhenfang Chen
Peihao Chen
Yikang Shen
Chuang Gan
MLLM
13
14
0
06 Nov 2023
What's "up" with vision-language models? Investigating their struggle
  with spatial reasoning
What's "up" with vision-language models? Investigating their struggle with spatial reasoning
Amita Kamath
Jack Hessel
Kai-Wei Chang
LRM
CoGe
13
95
0
30 Oct 2023
EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression
  Recognition
EmoCLIP: A Vision-Language Method for Zero-Shot Video Facial Expression Recognition
Niki Maria Foteinopoulou
Ioannis Patras
VLM
19
16
0
25 Oct 2023
Lang3DSG: Language-based contrastive pre-training for 3D Scene Graph
  prediction
Lang3DSG: Language-based contrastive pre-training for 3D Scene Graph prediction
Sebastian Koch
Pedro Hermosilla
Narunas Vaskevicius
Mirco Colosi
Timo Ropinski
29
9
0
25 Oct 2023
Length is a Curse and a Blessing for Document-level Semantics
Length is a Curse and a Blessing for Document-level Semantics
Chenghao Xiao
Yizhi Li
♣. G. Thomas
Hudson ♠ Chenghua
Al Moubayed
22
6
0
24 Oct 2023
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language
  Models
CompA: Addressing the Gap in Compositional Reasoning in Audio-Language Models
Sreyan Ghosh
Ashish Seth
Sonal Kumar
Utkarsh Tyagi
Chandra Kiran Reddy Evuru
S. Ramaneswaran
S. Sakshi
Oriol Nieto
R. Duraiswami
Dinesh Manocha
AuLLM
VLM
CoGe
35
21
0
12 Oct 2023
Visual Data-Type Understanding does not emerge from Scaling
  Vision-Language Models
Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models
Vishaal Udandarao
Max F. Burg
Samuel Albanie
Matthias Bethge
VLM
26
9
0
12 Oct 2023
The Role of Linguistic Priors in Measuring Compositional Generalization
  of Vision-Language Models
The Role of Linguistic Priors in Measuring Compositional Generalization of Vision-Language Models
Chenwei Wu
Erran L. Li
Stefano Ermon
Patrick Haffner
Rong Ge
Zaiwei Zhang
VLM
CoGe
24
0
0
04 Oct 2023
Predicated Diffusion: Predicate Logic-Based Attention Guidance for
  Text-to-Image Diffusion Models
Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models
Kota Sueyoshi
Takashi Matsubara
DiffM
8
8
0
03 Oct 2023
Towards reporting bias in visual-language datasets: bimodal augmentation
  by decoupling object-attribute association
Towards reporting bias in visual-language datasets: bimodal augmentation by decoupling object-attribute association
Qiyu Wu
Mengjie Zhao
Yutong He
Lang Huang
Junya Ono
Hiromi Wakaki
Yuki Mitsufuji
12
4
0
02 Oct 2023
CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic
  Segmentation For-Free
CLIP-DIY: CLIP Dense Inference Yields Open-Vocabulary Semantic Segmentation For-Free
Monika Wysoczañska
Michael Ramamonjisoa
Tomasz Trzciñski
Oriane Siméoni
3DV
VLM
19
20
0
25 Sep 2023
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language
  Model as an Agent
LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent
Jianing Yang
Xuweiyi Chen
Shengyi Qian
Nikhil Madaan
Madhavan Iyengar
David Fouhey
Joyce Chai
LM&Ro
LLMAG
22
84
0
21 Sep 2023
Looking at words and points with attention: a benchmark for
  text-to-shape coherence
Looking at words and points with attention: a benchmark for text-to-shape coherence
Andrea Amaduzzi
Giuseppe Lisanti
Samuele Salti
Luigi Di Stefano
16
2
0
14 Sep 2023
Compositional Learning of Visually-Grounded Concepts Using Reinforcement
Compositional Learning of Visually-Grounded Concepts Using Reinforcement
Zijun Lin
Haidi Azaman
M Ganesh Kumar
Cheston Tan
CoGe
OffRL
17
3
0
08 Sep 2023
A Survey of Diffusion Based Image Generation Models: Issues and Their
  Solutions
A Survey of Diffusion Based Image Generation Models: Issues and Their Solutions
Tianyi Zhang
Zheng Wang
Jin Huang
M. M. Tasnim
Wei Shi
VLM
11
21
0
25 Aug 2023
Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language
  Pretraining?
Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?
Fei-Yue Wang
Liang Ding
Jun Rao
Ye Liu
Li Shen
Changxing Ding
27
15
0
24 Aug 2023
An Examination of the Compositionality of Large Generative
  Vision-Language Models
An Examination of the Compositionality of Large Generative Vision-Language Models
Teli Ma
Rong Li
Junwei Liang
CoGe
24
2
0
21 Aug 2023
PUG: Photorealistic and Semantically Controllable Synthetic Data for
  Representation Learning
PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning
Florian Bordes
Shashank Shekhar
Mark Ibrahim
Diane Bouchacourt
Pascal Vincent
Ari S. Morcos
23
25
0
08 Aug 2023
Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating
  Vision-Language Models
Food-500 Cap: A Fine-Grained Food Caption Benchmark for Evaluating Vision-Language Models
Zheng Ma
Mianzhi Pan
Wenhan Wu
Ka Leong Cheng
Jianbing Zhang
Shujian Huang
Jiajun Chen
VLM
CoGe
18
3
0
06 Aug 2023
PerceptionCLIP: Visual Classification by Inferring and Conditioning on
  Contexts
PerceptionCLIP: Visual Classification by Inferring and Conditioning on Contexts
Bang An
Sicheng Zhu
Michael-Andrei Panaitescu-Liess
Chaithanya Kumar Mummadi
Furong Huang
VLM
20
7
0
02 Aug 2023
Cross-Modal Concept Learning and Inference for Vision-Language Models
Cross-Modal Concept Learning and Inference for Vision-Language Models
Yi Zhang
Ce Zhang
Yushun Tang
Z. He
VLM
MLLM
CLIP
23
15
0
28 Jul 2023
Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation
Distilled Feature Fields Enable Few-Shot Language-Guided Manipulation
Bokui (William) Shen
Ge Yang
Alan Yu
J. Wong
L. Kaelbling
Phillip Isola
VLM
16
102
0
27 Jul 2023
Distilling Knowledge from Text-to-Image Generative Models Improves
  Visio-Linguistic Reasoning in CLIP
Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP
S. Basu
S. Hu
Maziar Sanjabi
Daniela Massiceti
S. Feizi
VLM
11
3
0
18 Jul 2023
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present,
  and Future
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future
Chaoyang Zhu
Long Chen
ObjD
VLM
24
32
0
18 Jul 2023
TIAM -- A Metric for Evaluating Alignment in Text-to-Image Generation
TIAM -- A Metric for Evaluating Alignment in Text-to-Image Generation
P. Grimal
Hervé Le Borgne
Olivier Ferret
Julien Tourille
EGVM
40
10
0
11 Jul 2023
Text Descriptions are Compressive and Invariant Representations for
  Visual Learning
Text Descriptions are Compressive and Invariant Representations for Visual Learning
Zhili Feng
Anna Bair
J. Zico Kolter
VLM
22
6
0
10 Jul 2023
Benchmarking Zero-Shot Recognition with Vision-Language Models:
  Challenges on Granularity and Specificity
Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity
Zhenlin Xu
Yi Zhu
Tiffany Deng
Abhay Mittal
Yanbei Chen
Manchen Wang
Paolo Favaro
Joseph Tighe
Davide Modolo
VLM
CoGe
13
7
0
28 Jun 2023
SugarCrepe: Fixing Hackable Benchmarks for Vision-Language
  Compositionality
SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality
Cheng-Yu Hsieh
Jieyu Zhang
Zixian Ma
Aniruddha Kembhavi
Ranjay Krishna
CoGe
38
115
0
26 Jun 2023
Bring Your Own Data! Self-Supervised Evaluation for Large Language
  Models
Bring Your Own Data! Self-Supervised Evaluation for Large Language Models
Neel Jain
Khalid Saifullah
Yuxin Wen
John Kirchenbauer
Manli Shu
Aniruddha Saha
Micah Goldblum
Jonas Geiping
Tom Goldstein
ALM
ELM
22
23
0
23 Jun 2023
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text
  Documents
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
Hugo Laurenccon
Lucile Saulnier
Léo Tronchon
Stas Bekman
Amanpreet Singh
...
Siddharth Karamcheti
Alexander M. Rush
Douwe Kiela
Matthieu Cord
Victor Sanh
25
230
0
21 Jun 2023
Mass-Producing Failures of Multimodal Systems with Language Models
Mass-Producing Failures of Multimodal Systems with Language Models
Shengbang Tong
Erik Jones
Jacob Steinhardt
30
33
0
21 Jun 2023
Dissecting Multimodality in VideoQA Transformer Models by Impairing
  Modality Fusion
Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion
Isha Rawal
Alexander Matyasko
Shantanu Jaiswal
Basura Fernando
Cheston Tan
16
1
0
15 Jun 2023
Linguistic Binding in Diffusion Models: Enhancing Attribute
  Correspondence through Attention Map Alignment
Linguistic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment
Royi Rassin
Eran Hirsch
Daniel Glickman
Shauli Ravfogel
Yoav Goldberg
Gal Chechik
DiffM
33
100
0
15 Jun 2023
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to
  Enhance Visio-Linguistic Compositional Understanding
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding
Le Zhang
Rabiul Awal
Aishwarya Agrawal
CoGe
VLM
31
9
0
15 Jun 2023
Where Does My Model Underperform? A Human Evaluation of Slice Discovery
  Algorithms
Where Does My Model Underperform? A Human Evaluation of Slice Discovery Algorithms
Nari Johnson
Ángel Alexander Cabrera
Gregory Plumb
Ameet Talwalkar
26
13
0
13 Jun 2023
Waffling around for Performance: Visual Classification with Random Words
  and Broad Concepts
Waffling around for Performance: Visual Classification with Random Words and Broad Concepts
Karsten Roth
Jae Myung Kim
A. Sophia Koepke
Oriol Vinyals
Cordelia Schmid
Zeynep Akata
VLM
19
70
0
12 Jun 2023
Aladdin: Zero-Shot Hallucination of Stylized 3D Assets from Abstract
  Scene Descriptions
Aladdin: Zero-Shot Hallucination of Stylized 3D Assets from Abstract Scene Descriptions
Ian Huang
Vrishab Krishna
Omoruyi E. Atekha
Leonidas J. Guibas
DiffM
VGen
22
11
0
09 Jun 2023
StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual
  Representation Learners
StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners
Yonglong Tian
Lijie Fan
Phillip Isola
Huiwen Chang
Dilip Krishnan
VLM
DiffM
11
139
0
01 Jun 2023
RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine
  Semantic Re-alignment
RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment
Guian Fang
Zutao Jiang
Jianhua Han
Guangsong Lu
Hang Xu
Shengcai Liao
Xiaodan Liang
EGVM
19
1
0
31 May 2023
Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL
  Models
Dense and Aligned Captions (DAC) Promote Compositional Reasoning in VL Models
Sivan Doveh
Assaf Arbelle
Sivan Harary
Roei Herzig
Donghyun Kim
...
Rameswar Panda
Raja Giryes
Rogerio Feris
S. Ullman
Leonid Karlinsky
VLM
CoGe
31
52
0
31 May 2023
Compositional diversity in visual concept learning
Compositional diversity in visual concept learning
Yanli Zhou
Reuben Feinman
Brenden Lake
CoGe
OCL
24
8
0
30 May 2023
Scalable Performance Analysis for Vision-Language Models
Scalable Performance Analysis for Vision-Language Models
Santiago Castro
Oana Ignat
Rada Mihalcea
VLM
19
1
0
30 May 2023
On Evaluating Adversarial Robustness of Large Vision-Language Models
On Evaluating Adversarial Robustness of Large Vision-Language Models
Yunqing Zhao
Tianyu Pang
Chao Du
Xiao Yang
Chongxuan Li
Ngai-man Cheung
Min-Bin Lin
VLM
AAML
MLLM
14
166
0
26 May 2023
Are Diffusion Models Vision-And-Language Reasoners?
Are Diffusion Models Vision-And-Language Reasoners?
Benno Krojer
Elinor Poole-Dayan
Vikram S. Voleti
Christopher Pal
Siva Reddy
34
12
0
25 May 2023
Previous
123456
Next