30

When Small Models Are Right for Wrong Reasons: Process Verification for Trustworthy Agents

Laksh Advani
Main:4 Pages
2 Figures
Bibliography:1 Pages
3 Tables
Appendix:1 Pages
Abstract

Deploying small language models (7-9B parameters) as autonomous agents requires trust in their reasoning, not just their outputs. We reveal a critical reliability crisis: 50-69\% of correct answers from these models contain fundamentally flawed reasoning -- a ``Right-for-Wrong-Reasons'' phenomenon invisible to standard accuracy metrics. Through analysis of 10,734 reasoning traces across three models and diverse tasks, we introduce the Reasoning Integrity Score (RIS), a process-based metric validated with substantial inter-rater agreement (κ=0.657\kappa=0.657). Conventional practices are challenged by our findings: while retrieval-augmented generation (RAG) significantly improves reasoning integrity (Cohen's d=0.23d=0.23--0.930.93), meta-cognitive interventions like self-critique often harm performance (d=0.14d=-0.14 to 0.33-0.33) in small models on the evaluated tasks. Mechanistic analysis reveals RAG succeeds by grounding calculations in external evidence, reducing errors by 7.6\%, while meta-cognition amplifies confusion without sufficient model capacity. To enable deployment, verification capabilities are distilled into a neural classifier achieving 0.86 F1-score with 100×\times speedup. These results underscore the necessity of process-based verification for trustworthy agents: accuracy alone is dangerously insufficient when models can be right for entirely wrong reasons.

View on arXiv
Comments on this paper