ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.00309
32
0
v1v2v3 (latest)

Evaluation of LLMs for mathematical problem solving

30 May 2025
Ruonan Wang
Runxi Wang
Yunwen Shen
Chengfeng Wu
Qinglin Zhou
Rohitash Chandra
    ELMLRM
ArXiv (abs)PDFHTML
Main:18 Pages
5 Figures
Bibliography:4 Pages
22 Tables
Appendix:1 Pages
Abstract

Large Language Models (LLMs) have shown impressive performance on a range of educational tasks, but are still understudied for their potential to solve mathematical problems. In this study, we compare three prominent LLMs, including GPT-4o, DeepSeek-V3, and Gemini-2.0, on three mathematics datasets of varying complexities (GSM8K, MATH500, and MIT Open Courseware datasets). We take a five-dimensional approach based on the Structured Chain-of-Thought (SCoT) framework to assess final answer correctness, step completeness, step validity, intermediate calculation accuracy, and problem comprehension. The results show that GPT-4o is the most stable and consistent in performance across all the datasets, but particularly it performs outstandingly in high-level questions of the MIT Open Courseware dataset. DeepSeek-V3 is competitively strong in well-structured domains such as optimisation, but suffers from fluctuations in accuracy in statistical inference tasks. Gemini-2.0 shows strong linguistic understanding and clarity in well-structured problems but performs poorly in multi-step reasoning and symbolic logic. Our error analysis reveals particular deficits in each model: GPT-4o is at times lacking in sufficient explanation or precision; DeepSeek-V3 leaves out intermediate steps; and Gemini-2.0 is less flexible in mathematical reasoning in higher dimensions.

View on arXiv
@article{wang2025_2506.00309,
  title={ Evaluation of LLMs for mathematical problem solving },
  author={ Ruonan Wang and Runxi Wang and Yunwen Shen and Chengfeng Wu and Qinglin Zhou and Rohitash Chandra },
  journal={arXiv preprint arXiv:2506.00309},
  year={ 2025 }
}
Comments on this paper