Large Language Models (LLMs) have been applied to Math Word Problems (MWPs)
with transformative impacts, revolutionizing how these complex problems are
approached and solved in various domains including educational settings.
However, the evaluation of these models often prioritizes final accuracy,
overlooking the crucial aspect of reasoning capabilities. This work addresses
this gap by focusing on the ability of LLMs to detect and correct reasoning
mistakes. We introduce a novel dataset MWP-MISTAKE, incorporating MWPs with
both correct and incorrect reasoning steps generated through rule-based methods
and smaller language models. Our comprehensive benchmarking reveals significant
insights into the strengths and weaknesses of state-of-the-art models, such as
GPT-4o, GPT-4, GPT-3.5Turbo, and others. We highlight GPT-o′ssuperiorperformanceinmistakedetectionandrectificationandthepersistentchallengesfacedbysmallermodels.Additionally,weidentifyissuesrelatedtodatacontaminationandmemorization,impactingthereliabilityofLLMsinreal−worldapplications.OurfindingsemphasizetheimportanceofrigorousevaluationofreasoningprocessesandproposefuturedirectionstoenhancethegeneralizationandrobustnessofLLMsinmathematicalproblem−solving.