Can Large Vision Language Models Read Maps Like a Human?

18 March 2025

Abstract

In this paper, we introduce MapBench-the first dataset specifically designed for human-readable, pixel-based map-based outdoor navigation, curated from complex path finding scenarios. MapBench comprises over 1600 pixel space map path finding problems from 100 diverse maps. In MapBench, LVLMs generate language-based navigation instructions given a map image and a query with beginning and end landmarks. For each map, MapBench provides Map Space Scene Graph (MSSG) as an indexing data structure to convert between natural language and evaluate LVLM-generated results. We demonstrate that MapBench significantly challenges state-of-the-art LVLMs both zero-shot prompting and a Chain-of-Thought (CoT) augmented reasoning framework that decomposes map navigation into sequential cognitive processes. Our evaluation of both open-source and closed-source LVLMs underscores the substantial difficulty posed by MapBench, revealing critical limitations in their spatial reasoning and structured decision-making capabilities. We release all the code and dataset inthis https URL.

View on arXiv

@article{xing2025_2503.14607,
  title={ Can Large Vision Language Models Read Maps Like a Human? },
  author={ Shuo Xing and Zezhou Sun and Shuangyu Xie and Kaiyuan Chen and Yanjia Huang and Yuping Wang and Jiachen Li and Dezhen Song and Zhengzhong Tu },
  journal={arXiv preprint arXiv:2503.14607},
  year={ 2025 }
}

Comments on this paper