ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.07087
40
0

Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph Coloring

10 February 2025
Alex Heyman
Joel Zylberberg
    LRM
ArXivPDFHTML
Abstract

Contemporary large language models are powerful problem-solving tools, but they exhibit weaknesses in their reasoning abilities which ongoing research seeks to mitigate. We investigate graph coloring as a means of evaluating an LLM's capacities for systematic step-by-step reasoning and possibility space exploration, as well as effects of semantic problem framing. We test Claude 3.5 Sonnet, Llama 3.1 405B, Gemini 1.5 Pro, GPT-4o, o1-mini, and DeepSeek-R1 on a dataset of kkk-coloring problems with 2≤k≤42 \leq k \leq 42≤k≤4 and vertex count 4≤n≤84 \leq n \leq 84≤n≤8, using partial algorithmic solvers to further categorize problems by difficulty. In addition to substantial but varying framing effects, we find that all models except o1-mini and R1 exhibit >60%>60\%>60% error rates on difficult problem types in all frames (>15%>15\%>15% for o1-mini and >10%>10\%>10% for R1), and no model achieves perfect accuracy even in the simple domain of 2-coloring 4-vertex graphs. Our results highlight both the considerable recent progress in LLM systematic reasoning and the limits of its reliability, especially in relation to increasing computational costs. We expect that more complex graph coloring problems, and procedural generation of arbitrary-complexity reasoning problems more broadly, offer further untapped potential for LLM benchmarking.

View on arXiv
@article{heyman2025_2502.07087,
  title={ Evaluating the Systematic Reasoning Abilities of Large Language Models through Graph Coloring },
  author={ Alex Heyman and Joel Zylberberg },
  journal={arXiv preprint arXiv:2502.07087},
  year={ 2025 }
}
Comments on this paper