Multi-modal Large Language Models (MLLMs) have advanced greatly in general tasks. However, they still face challenges in geometric reasoning, a task that requires synergistic integration of visual recognition proficiency and complex reasoning strength. Existing MLLMs prioritize optimizing the LLM backbone to enhance problem-solving capabilities, while rarely emphasizing improvements in discerning visual elements. However, we reveal that MLLMs suffer from severe visual perception deficiencies, including inaccurate geometric comprehension and severe visual hallucinations, which constrain their reasoning performance. To address this issue, we revisit geometric reasoning through a visual-centric lens that highlights the role of visual perception. To achieve this, we propose EAGLE, a novel coarse-to-fine visual enhancement framework that progressively leverages LLMs' guidance to improve perception proficiency. Specifically, given the substantial disparity between geometric diagrams and natural images, we first introduce Geometric Knowledge Injection. This process explores fundamental knowledge from diagram-caption data to enhance recognition capabilities and improve geometry-language alignments. Then, recognizing that different elements contribute unequally in the reasoning process, we introduce Geometric Knowledge Refinement. This stage leverages LLM-driven chain-of-thought solutions to guide the vision encoder in adaptively prioritizing key elements, fostering a synergistic interplay between visual comprehension and mathematical reasoning. Finally, we develop EAGLE, a geometry expert with strong perception and reasoning capabilities. Extensive experiments demonstrate its effectiveness on three popular benchmarks.

View on arXiv

Comments on this paper