Large Language Models' Reasoning Stalls: An Investigation into the Capabilities of Frontier Models

26 May 2025

Lachlan McGinness

ArXiv (abs)PDF HTML Github

Main:11 Pages

8 Figures

Bibliography:5 Pages

3 Tables

Appendix:6 Pages

Abstract

Empirical methods to examine the capability of Large Language Models (LLMs) to use Automated Theorem Prover (ATP) reasoning strategies are studied. We evaluate the performance of State of the Art models from December 2023 and August 2024 on PRONTOQA steamroller reasoning problems. For that, we develop methods for assessing LLM response accuracy and correct answer correlation.

View on arXiv

Comments on this paper