v1v2 (latest)

Vision-language models lag human performance on physical dynamics and intent reasoning

4 January 2026

Tianjun Gu

Jingyu Gong

Zhizhong Zhang

Yuan Xie

Lizhuang Ma

Xin Tan

Athanasios V

LRM

ArXiv (abs)PDF HTML Github

Main:8 Pages

8 Figures

Bibliography:2 Pages

3 Tables

Appendix:4 Pages

Abstract

Spatial intelligence is central to embodied cognition, yet contemporary AI systems still struggle to reason about physical interactions in open-world human environments. Despite strong performance on controlled benchmarks, vision-language models often fail to jointly model physical dynamics, reference frames, and the latent human intentions that drive spatial change. We introduce Teleo-Spatial Intelligence (TSI), a reasoning capability that links spatiotemporal change to goal-directed structure. To evaluate TSI, we present EscherVerse, a large-scale open-world resource built from 11,328 real-world videos, including an 8,000-example benchmark and a 35,963-example instruction-tuning set. Across 27 state-of-the-art vision-language models and an independent analysis of first-pass human responses from 11 annotators, we identify a persistent teleo-spatial reasoning gap: the strongest proprietary model achieves 57.26\% overall accuracy, far below first-pass human performance, which ranges from 84.81\% to 95.14\% with a mean of 90.62\%. Fine-tuning on real-world, intent-aware data narrows this gap for open-weight models, but does not close it. EscherVerse provides a diagnostic testbed for purpose-aware spatial reasoning and highlights a critical gap between pattern recognition and human-level understanding in embodied AI.

View on arXiv

Comments on this paper