382

WakenLLM: Evaluating Reasoning Potential and Stability in LLMs via Fine-Grained Benchmarking

Main:9 Pages
7 Figures
Bibliography:4 Pages
13 Tables
Appendix:24 Pages
Abstract

Large Language Models (LLMs) frequently output the label Unknown in reasoning tasks, where two scenarios may appear: (i) an input sample is genuinely unverifiable, but the model cannot understand why; and (ii) a verifiable problem that the model fails to solve, thus outputs Unknown. We refer to these cases collectively as the Vague Perception phenomenon. Current evaluations focus on whether such answers are honest, rather than analyzing the limits of LLM reasoning.

View on arXiv
Comments on this paper