Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models

21 March 2025

Abstract

Vision-Language Models (VLMs) excel at identifying and describing objects but struggle with spatial reasoning such as accurately understanding the relative positions of objects. Inspired by the dual-pathway (ventral-dorsal) model of human vision, we investigate why VLMs fail spatial tasks despite strong object recognition capabilities. Our interpretability-driven analysis reveals a critical underlying cause: vision embeddings in VLMs are treated primarily as semantic ``bag-of-tokens," overshadowing subtle yet crucial positional cues due to their disproportionately large embedding norms. We validate this insight through extensive diagnostic experiments, demonstrating minimal performance impact when token orders or fine-grained spatial details are removed. Guided by these findings, we propose simple, interpretable interventions, including normalizing vision embedding norms and extracting mid-layer spatially rich features, to restore spatial awareness. Empirical results on both our synthetic data and standard benchmarks demonstrate improved spatial reasoning capabilities, highlighting the value of interpretability-informed design choices. Our study not only uncovers fundamental limitations in current VLM architectures but also provides actionable insights for enhancing structured perception of visual scenes.

View on arXiv

@article{qi2025_2503.17349,
  title={ Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models },
  author={ Jianing Qi and Jiawei Liu and Hao Tang and Zhigang Zhu },
  journal={arXiv preprint arXiv:2503.17349},
  year={ 2025 }
}

Comments on this paper