ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2210.01936
  4. Cited By
When and why vision-language models behave like bags-of-words, and what
  to do about it?

When and why vision-language models behave like bags-of-words, and what to do about it?

4 October 2022
Mert Yuksekgonul
Federico Bianchi
Pratyusha Kalluri
Dan Jurafsky
James Y. Zou
    VLM
    CoGe
ArXivPDFHTML

Papers citing "When and why vision-language models behave like bags-of-words, and what to do about it?"

50 / 285 papers shown
Title
Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models
Visually Guided Decoding: Gradient-Free Hard Prompt Inversion with Language Models
Donghoon Kim
Minji Bae
Kyuhong Shim
B. Shim
26
0
0
13 May 2025
Compositional Image-Text Matching and Retrieval by Grounding Entities
Compositional Image-Text Matching and Retrieval by Grounding Entities
Madhukar Reddy Vongala
Saurabh Srivastava
Jana Kosecka
CLIP
CoGe
VLM
36
0
0
04 May 2025
Multi-Modal Language Models as Text-to-Image Model Evaluators
Multi-Modal Language Models as Text-to-Image Model Evaluators
Jiahui Chen
Candace Ross
Reyhane Askari Hemmat
Koustuv Sinha
Melissa Hall
M. Drozdzal
Adriana Romero-Soriano
EGVM
60
0
0
01 May 2025
The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning
The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning
Siyi Chen
Yimeng Zhang
Sijia Liu
Q. Qu
AAML
88
0
0
30 Apr 2025
Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs
Breaking the Modality Barrier: Universal Embedding Learning with Multimodal LLMs
Tiancheng Gu
Kaicheng Yang
Ziyong Feng
Xingjun Wang
Yanzhao Zhang
Dingkun Long
Yingda Chen
Weidong Cai
Jiankang Deng
VLM
114
0
0
24 Apr 2025
Text-to-Image Alignment in Denoising-Based Models through Step Selection
Text-to-Image Alignment in Denoising-Based Models through Step Selection
P. Grimal
Hervé Le Borgne
Olivier Ferret
DiffM
EGVM
48
0
0
24 Apr 2025
Decoupled Global-Local Alignment for Improving Compositional Understanding
Decoupled Global-Local Alignment for Improving Compositional Understanding
Xiaoxing Hu
Kaicheng Yang
J. Z. Wang
Haoran Xu
Ziyong Feng
Y. Wang
VLM
89
0
0
23 Apr 2025
MIEB: Massive Image Embedding Benchmark
MIEB: Massive Image Embedding Benchmark
Chenghao Xiao
Isaac Chung
Imene Kerboua
Jamie Stirling
Xin Zhang
Márton Kardos
Roman Solomatin
Noura Al Moubayed
K. Enevoldsen
Niklas Muennighoff
VLM
35
0
0
14 Apr 2025
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
Weixian Lei
Jiacong Wang
Haochen Wang
X. Li
Jun Hao Liew
Jiashi Feng
Zilong Huang
26
1
0
14 Apr 2025
Don't Deceive Me: Mitigating Gaslighting through Attention Reallocation in LMMs
Don't Deceive Me: Mitigating Gaslighting through Attention Reallocation in LMMs
Pengkun Jiao
Bin Zhu
Jingjing Chen
Chong-Wah Ngo
Yu Jiang
38
0
0
13 Apr 2025
Mixed Signals: Decoding VLMs' Reasoning and Underlying Bias in Vision-Language Conflict
Mixed Signals: Decoding VLMs' Reasoning and Underlying Bias in Vision-Language Conflict
Pouya Pezeshkpour
Moin Aminnaseri
Estevam R. Hruschka
27
0
0
11 Apr 2025
Have we unified image generation and understanding yet? An empirical study of GPT-4o's image generation ability
Have we unified image generation and understanding yet? An empirical study of GPT-4o's image generation ability
Ning Li
Jingran Zhang
Justin Cui
MLLM
70
1
0
09 Apr 2025
Human-like compositional learning of visually-grounded concepts using synthetic environments
Human-like compositional learning of visually-grounded concepts using synthetic environments
Zijun Lin
M Ganesh Kumar
Cheston Tan
OCL
CoGe
70
0
0
09 Apr 2025
Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data
Enhancing Compositional Reasoning in Vision-Language Models with Synthetic Preference Data
Samarth Mishra
Kate Saenko
Venkatesh Saligrama
CoGe
LRM
37
0
0
07 Apr 2025
REVEAL: Relation-based Video Representation Learning for Video-Question-Answering
REVEAL: Relation-based Video Representation Learning for Video-Question-Answering
Sofian Chaybouti
Walid Bousselham
Moritz Wolter
Hilde Kuehne
71
0
0
07 Apr 2025
Taxonomy-Aware Evaluation of Vision-Language Models
Taxonomy-Aware Evaluation of Vision-Language Models
Vésteinn Snæbjarnarson
Kevin Du
Niklas Stoehr
Serge J. Belongie
Ryan Cotterell
Nico Lang
Stella Frank
27
0
0
07 Apr 2025
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
Dahun Kim
A. Piergiovanni
Ganesh Mallya
A. Angelova
CoGe
36
0
0
04 Apr 2025
Concept Lancet: Image Editing with Compositional Representation Transplant
Concept Lancet: Image Editing with Compositional Representation Transplant
Jinqi Luo
Tianjiao Ding
Kwan Ho Ryan Chan
Hancheng Min
Chris Callison-Burch
René Vidal
DiffM
KELM
72
0
0
03 Apr 2025
Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis
Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis
J. Huang
Baoxiong Jia
Y. Wang
Ziyu Zhu
Xiongkun Linghu
Qing Li
Song-Chun Zhu
Siyuan Huang
75
3
0
28 Mar 2025
Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis
Spatial Transport Optimization by Repositioning Attention Map for Training-Free Text-to-Image Synthesis
Woojung Han
Yeonkyung Lee
Chanyoung Kim
Kwanghyun Park
Seong Jae Hwang
DiffM
60
0
0
28 Mar 2025
Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck
Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck
Adrian Bulat
Yassine Ouali
Georgios Tzimiropoulos
89
0
0
27 Mar 2025
On Large Multimodal Models as Open-World Image Classifiers
On Large Multimodal Models as Open-World Image Classifiers
Alessandro Conti
Massimiliano Mancini
Enrico Fini
Yiming Wang
Paolo Rota
Elisa Ricci
VLM
Presented at ResearchTrend Connect | VLM on 07 May 2025
82
0
0
27 Mar 2025
Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs
Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs
Zitian Wang
Yue Liao
Kang Rong
Fengyun Rao
Yibo Yang
Si Liu
70
0
0
26 Mar 2025
Towards Training-free Anomaly Detection with Vision and Language Foundation Models
Towards Training-free Anomaly Detection with Vision and Language Foundation Models
Jinjin Zhang
Guodong Wang
Yizhou Jin
Di Huang
42
1
0
24 Mar 2025
On the Perception Bottleneck of VLMs for Chart Understanding
On the Perception Bottleneck of VLMs for Chart Understanding
Junteng Liu
Weihao Zeng
Xiwen Zhang
Yijun Wang
Zifei Shan
Junxian He
60
0
0
24 Mar 2025
Training-Free Personalization via Retrieval and Reasoning on Fingerprints
Training-Free Personalization via Retrieval and Reasoning on Fingerprints
Deepayan Das
Davide Talon
Yiming Wang
Massimiliano Mancini
Elisa Ricci
VLM
LRM
42
0
0
24 Mar 2025
Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning
Instruct-CLIP: Improving Instruction-Guided Image Editing with Automated Data Refinement Using Contrastive Learning
Sherry X Chen
Misha Sra
Pradeep Sen
50
0
0
24 Mar 2025
Can Text-to-Video Generation help Video-Language Alignment?
Can Text-to-Video Generation help Video-Language Alignment?
Luca Zanella
Massimiliano Mancini
Willi Menapace
Sergey Tulyakov
Yiming Wang
Elisa Ricci
DiffM
VGen
55
0
0
24 Mar 2025
Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models
Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models
Davide Berasi
Matteo Farina
Massimiliano Mancini
Elisa Ricci
Nicola Strisciuglio
CoGe
66
0
0
21 Mar 2025
Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models
Jianing Qi
Jiawei Liu
Hao Tang
Zhigang Zhu
101
1
0
21 Mar 2025
Dynamic Relation Inference via Verb Embeddings
Dynamic Relation Inference via Verb Embeddings
Omri Suissa
Muhiim Ali
Ariana Azarbal
Hui Shen
Shekhar Pradhan
41
0
0
17 Mar 2025
Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection
Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection
Chuhan Zhang
Chaoyang Zhu
Pingcheng Dong
Long Chen
Dong Zhang
ObjD
VLM
108
0
0
14 Mar 2025
Is CLIP ideal? No. Can we fix it? Yes!
Raphi Kang
Yue Song
Georgia Gkioxari
Pietro Perona
VLM
53
0
0
10 Mar 2025
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation
Yuwei Niu
Munan Ning
Mengren Zheng
Bin Lin
Peng Jin
Jiaqi Liao
Kunpeng Ning
Bin Zhu
Li Yuan
EGVM
55
10
0
10 Mar 2025
Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data
Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data
Haoxin Li
Boyang Li
CoGe
69
0
0
03 Mar 2025
LIFT-GS: Cross-Scene Render-Supervised Distillation for 3D Language Grounding
LIFT-GS: Cross-Scene Render-Supervised Distillation for 3D Language Grounding
Ang Cao
Sergio Arnaud
Oleksandr Maksymets
Jianing Yang
Ayush Jain
...
Aravind Rajeswaran
Franziska Meier
Justin Johnson
Jeong Joon Park
Alexander Sax
63
0
0
27 Feb 2025
Analyzing CLIP's Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study
Analyzing CLIP's Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study
Reza Abbasi
Ali Nazari
Aminreza Sefid
Mohammadali Banayeeanzade
M. Rohban
M. Baghshah
VLM
56
1
0
27 Feb 2025
CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation
CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation
Reza Abbasi
Ali Nazari
Aminreza Sefid
Mohammadali Banayeeanzade
M. Rohban
M. Baghshah
VLM
73
1
0
27 Feb 2025
Investigating and Enhancing Vision-Audio Capability in Omnimodal Large Language Models
Rui Hu
Delai Qiu
Shuyu Wei
J. Zhang
Yining Wang
Shengping Liu
Jitao Sang
AuLLM
VLM
43
0
0
27 Feb 2025
Contrastive Visual Data Augmentation
Contrastive Visual Data Augmentation
Yu Zhou
B. Li
Mohan Tang
Xiaomeng Jin
Te-Lin Wu
Kuan-Hao Huang
Heng Ji
Kai-Wei Chang
Nanyun Peng
57
0
0
24 Feb 2025
Can Hallucination Correction Improve Video-Language Alignment?
Can Hallucination Correction Improve Video-Language Alignment?
Lingjun Zhao
Mingyang Xie
Paola Cascante-Bonilla
Hal Daumé III
Kwonjoon Lee
HILM
VLM
57
0
0
20 Feb 2025
Object-centric Binding in Contrastive Language-Image Pretraining
Object-centric Binding in Contrastive Language-Image Pretraining
Rim Assouel
Pietro Astolfi
Florian Bordes
M. Drozdzal
Adriana Romero Soriano
OCL
VLM
CoGe
103
0
0
19 Feb 2025
Image Embedding Sampling Method for Diverse Captioning
Image Embedding Sampling Method for Diverse Captioning
Sania Waheed
Na Min An
55
0
0
14 Feb 2025
From No to Know: Taxonomy, Challenges, and Opportunities for Negation Understanding in Multimodal Foundation Models
From No to Know: Taxonomy, Challenges, and Opportunities for Negation Understanding in Multimodal Foundation Models
Mayank Vatsa
Aparna Bharati
S. Mittal
Richa Singh
53
0
0
10 Feb 2025
RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception
RLS3: RL-Based Synthetic Sample Selection to Enhance Spatial Reasoning in Vision-Language Models for Indoor Autonomous Perception
Joshua R. Waite
Md Zahid Hasan
Qisai Liu
Zhanhong Jiang
Chinmay Hegde
S. Sarkar
OffRL
SyDa
158
1
0
31 Jan 2025
Calling a Spade a Heart: Gaslighting Multimodal Large Language Models via Negation
Calling a Spade a Heart: Gaslighting Multimodal Large Language Models via Negation
Bin Zhu
Hui yan Qi
Yinxuan Gui
Jingjing Chen
Chong-Wah Ngo
Ee-Peng Lim
83
1
0
31 Jan 2025
MASS: Overcoming Language Bias in Image-Text Matching
MASS: Overcoming Language Bias in Image-Text Matching
Jiwan Chung
Seungwon Lim
Sangkyu Lee
Youngjae Yu
VLM
30
0
0
20 Jan 2025
Know "No'' Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP
Know "No'' Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP
J. Park
Jungbeom Lee
Jongyoon Song
Sangwon Yu
Dahuin Jung
Sungroh Yoon
45
0
0
19 Jan 2025
Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models
Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models
Jeonghwan Kim
Heng Ji
MLLM
33
2
0
08 Jan 2025
Multi-Agent Planning Using Visual Language Models
Multi-Agent Planning Using Visual Language Models
Michele Brienza
F. Argenziano
Vincenzo Suriani
D. Bloisi
Daniele Nardi
LM&Ro
LLMAG
62
4
0
31 Dec 2024
123456
Next