ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.08744
16
0

DeepMath-Creative: A Benchmark for Evaluating Mathematical Creativity of Large Language Models

13 May 2025
Xiaoyang Chen
Xinan Dai
Yu Du
Qian Feng
Naixu Guo
T. Gu
Y. Gao
Yingyi Gao
Xudong Han
Xiang Jiang
Yilin Jin
Hongyi Lin
Shisheng Lin
Xiangnan Li
Yuante Li
Y. Li
Zhentao Lai
Zilu Ma
Yingrong Peng
Jiacheng Qian
H. Sun
Jianbo Sun
Zirui Wang
Siwei Wu
Z. Wang
Bin Xu
J. Xu
Yiyang Yu
Z. Yang
Hongji Zha
Ruichong Zhang
    LRM
ArXivPDFHTML
Abstract

To advance the mathematical proficiency of large language models (LLMs), the DeepMath team has launched an open-source initiative aimed at developing an open mathematical LLM and systematically evaluating its mathematical creativity. This paper represents the initial contribution of this initiative. While recent developments in mathematical LLMs have predominantly emphasized reasoning skills, as evidenced by benchmarks on elementary to undergraduate-level mathematical tasks, the creative capabilities of these models have received comparatively little attention, and evaluation datasets remain scarce. To address this gap, we propose an evaluation criteria for mathematical creativity and introduce DeepMath-Creative, a novel, high-quality benchmark comprising constructive problems across algebra, geometry, analysis, and other domains. We conduct a systematic evaluation of mainstream LLMs' creative problem-solving abilities using this dataset. Experimental results show that even under lenient scoring criteria -- emphasizing core solution components and disregarding minor inaccuracies, such as small logical gaps, incomplete justifications, or redundant explanations -- the best-performing model, O3 Mini, achieves merely 70% accuracy, primarily on basic undergraduate-level constructive tasks. Performance declines sharply on more complex problems, with models failing to provide substantive strategies for open problems. These findings suggest that, although current LLMs display a degree of constructive proficiency on familiar and lower-difficulty problems, such performance is likely attributable to the recombination of memorized patterns rather than authentic creative insight or novel synthesis.

View on arXiv
@article{chen2025_2505.08744,
  title={ DeepMath-Creative: A Benchmark for Evaluating Mathematical Creativity of Large Language Models },
  author={ Xiaoyang Chen and Xinan Dai and Yu Du and Qian Feng and Naixu Guo and Tingshuo Gu and Yuting Gao and Yingyi Gao and Xudong Han and Xiang Jiang and Yilin Jin and Hongyi Lin and Shisheng Lin and Xiangnan Li and Yuante Li and Yixing Li and Zhentao Lai and Zilu Ma and Yingrong Peng and Jiacheng Qian and Hao-Yu Sun and Jianbo Sun and Zirui Wang and Siwei Wu and Zian Wang and Bin Xu and Jianghao Xu and Yiyang Yu and Zichuan Yang and Hongji Zha and Ruichong Zhang },
  journal={arXiv preprint arXiv:2505.08744},
  year={ 2025 }
}
Comments on this paper