GeoBench: Rethinking Multimodal Geometric Problem-Solving via Hierarchical Evaluation

30 December 2025

Yuan Feng

Yue Yang

Xiaohan He

Jiatong Zhao

Jianlong Chen

Zijun Chen

Daocheng Fu

Qi Liu

Renqiu Xia

Bo Zhang

Junchi Yan

LRM

ArXiv (abs)PDF HTML Github (2★)

Main:12 Pages

8 Figures

Bibliography:3 Pages

13 Tables

Appendix:9 Pages

Abstract

Geometric problem solving constitutes a critical branch of mathematical reasoning, requiring precise analysis of shapes and spatial relationships. Current evaluations of geometric reasoning in vision-language models (VLMs) face limitations, including the risk of test data contamination from textbook-based benchmarks, overemphasis on final answers over reasoning processes, and insufficient diagnostic granularity. To address these issues, we present GeoBench, a hierarchical benchmark featuring four reasoning levels in geometric problem-solving: Visual Perception, Goal-Oriented Planning, Rigorous Theorem Application, and Self-Reflective Backtracking. Through six formally verified tasks generated via TrustGeoGen, we systematically assess capabilities ranging from attribute extraction to logical error correction. Experiments reveal that while reasoning models like OpenAI-o3 outperform general MLLMs, performance declines significantly with increasing task complexity. Key findings demonstrate that sub-goal decomposition and irrelevant premise filtering critically influence final problem-solving accuracy, whereas Chain-of-Thought prompting unexpectedly degrades performance in some tasks. These findings establish GeoBench as a comprehensive benchmark while offering actionable guidelines for developing geometric problem-solving systems.

View on arXiv

Comments on this paper