Vision Language Models See What You Want but not What You See

1 October 2024

Qingying Gao

Yijiang Li

Haiyun Lyu

Haoran Sun

Dezhi Luo

Hokin Deng

LRM

VLM

ArXiv PDF HTML

Abstract

Knowing others' intentions and taking others' perspectives are two core components of human intelligence that are considered to be instantiations of theory-of-mind. Infiltrating machines with these abilities is an important step towards building human-level artificial intelligence. Here, to investigate intentionality understanding and level-2 perspective-taking in Vision Language Models (VLMs), we constructed the IntentBench and PerspectBench, which together contains over 300 cognitive experiments grounded in real-world scenarios and classic cognitive tasks. We found VLMs achieving high performance on intentionality understanding but low performance on level-2 perspective-taking. This suggests a potential dissociation between simulation-based and theory-based theory-of-mind abilities in VLMs, highlighting the concern that they are not capable of using model-based reasoning to infer others' mental states. See $\href{this https URL}{Website}$

View on arXiv

@article{gao2025_2410.00324,
  title={ Vision Language Models See What You Want but not What You See },
  author={ Qingying Gao and Yijiang Li and Haiyun Lyu and Haoran Sun and Dezhi Luo and Hokin Deng },
  journal={arXiv preprint arXiv:2410.00324},
  year={ 2025 }
}

Comments on this paper