Large Language Models Are Struggle to Cope with Unreasonability in Math Problems

28 March 2024

Abstract

Recent research have demonstrated LLMs' impressive performance in math and reasoning. However, the capacity of LLMs to address math problems under unconventional conditions, such as internal inconsistencies and flawed assumptions, remains largely unexplored. In this paper, we propose a novel benchmark Unreasonable Math Problem (UMP) designed to assess LLMs' ability to recognize and respond to unreasonability in math problem. The benchmark consists of a carefully curated collection of unreasonable math questions across diverse types. Based on extensive experiments covering 19 LLMs, we observe that even state-of-the-art models such as GPT-4o achieve only limited performance of 0.6 in UMP, while reasoning models such as DeepSeek-R1 are prone to overthinking and unstable. We further explore strategies for improving the recognition of unreasonable inputs, shedding light on both the possibility and limitations of LLMs in this challenging setting.

View on arXiv

@article{ma2025_2403.19346,
  title={ Large Language Models Are Struggle to Cope with Unreasonability in Math Problems },
  author={ Jingyuan Ma and Damai Dai and Zihang Yuan and Rui li and Weilin Luo and Bin Wang and Qun Liu and Lei Sha and Zhifang Sui },
  journal={arXiv preprint arXiv:2403.19346},
  year={ 2025 }
}

Comments on this paper