ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.01773
55
0

Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas

3 March 2025
Shiqi Chen
Tongyao Zhu
Ruochen Zhou
Jinghan Zhang
Siyang Gao
Juan Carlos Niebles
Mor Geva
Junxian He
Jiajun Wu
Manling Li
    LRM
ArXivPDFHTML
Abstract

Large Vision Language Models (VLMs) have long struggled with spatial reasoning tasks. Surprisingly, even simple spatial reasoning tasks, such as recognizing "under" or "behind" relationships between only two objects, pose significant challenges for current VLMs. In this work, we study the spatial reasoning challenge from the lens of mechanistic interpretability, diving into the model's internal states to examine the interactions between image and text tokens. By tracing attention distribution over the image through out intermediate layers, we observe that successful spatial reasoning correlates strongly with the model's ability to align its attention distribution with actual object locations, particularly differing between familiar and unfamiliar spatial relationships. Motivated by these findings, we propose ADAPTVIS based on inference-time confidence scores to sharpen the attention on highly relevant regions when confident, while smoothing and broadening the attention window to consider a wider context when confidence is lower. This training-free decoding method shows significant improvement (e.g., up to a 50 absolute point improvement) on spatial reasoning benchmarks such as WhatsUp and VSR with negligible cost. We make code and data publicly available for research purposes atthis https URL.

View on arXiv
@article{chen2025_2503.01773,
  title={ Why Is Spatial Reasoning Hard for VLMs? An Attention Mechanism Perspective on Focus Areas },
  author={ Shiqi Chen and Tongyao Zhu and Ruochen Zhou and Jinghan Zhang and Siyang Gao and Juan Carlos Niebles and Mor Geva and Junxian He and Jiajun Wu and Manling Li },
  journal={arXiv preprint arXiv:2503.01773},
  year={ 2025 }
}
Comments on this paper