WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking
- LRM
Main:9 Pages
7 Figures
Bibliography:4 Pages
13 Tables
Appendix:24 Pages
Abstract
Large Language Models (LLMs) frequently output the label Unknown in reasoning tasks, where two scenarios may appear: (i) an input sample is genuinely unverifiable, but the model cannot understand why; and (ii) a verifiable problem that the model fails to solve, thus outputs Unknown. We refer to these cases collectively as the Vague Perception phenomenon. Current evaluations focus on whether such answers are honest, rather than analyzing the limits of LLM reasoning.
View on arXivComments on this paper
