21
3

Vision Language Models See What You Want but not What You See

Qingying Gao
Yijiang Li
Haiyun Lyu
Haoran Sun
Dezhi Luo
Hokin Deng
Abstract

Knowing others' intentions and taking others' perspectives are two core components of human intelligence that are considered to be instantiations of theory-of-mind. Infiltrating machines with these abilities is an important step towards building human-level artificial intelligence. Here, to investigate intentionality understanding and level-2 perspective-taking in Vision Language Models (VLMs), we constructed the IntentBench and PerspectBench, which together contains over 300 cognitive experiments grounded in real-world scenarios and classic cognitive tasks. We found VLMs achieving high performance on intentionality understanding but low performance on level-2 perspective-taking. This suggests a potential dissociation between simulation-based and theory-based theory-of-mind abilities in VLMs, highlighting the concern that they are not capable of using model-based reasoning to infer others' mental states. See \href\href{this https URL}{Website}

View on arXiv
@article{gao2025_2410.00324,
  title={ Vision Language Models See What You Want but not What You See },
  author={ Qingying Gao and Yijiang Li and Haiyun Lyu and Haoran Sun and Dezhi Luo and Hokin Deng },
  journal={arXiv preprint arXiv:2410.00324},
  year={ 2025 }
}
Comments on this paper