272
v1v2v3 (latest)

Representational and Behavioral Stability of Truth in Large Language Models

Main:8 Pages
28 Figures
Bibliography:2 Pages
9 Tables
Appendix:15 Pages
Abstract

Large language models (LLMs) are increasingly used as information sources, yet small changes in semantic framing can destabilize their truth judgments. We propose P-StaT (Perturbation Stability of Truth), an evaluation framework for testing belief stability under controlled semantic perturbations in representational and behavioral settings via probing and zero-shot prompting. Across sixteen open-source LLMs and three domains, we compare perturbations involving epistemically familiar Neither statements drawn from well-known fictional contexts (Fictional) to those involving unfamiliar Neither statements not seen in training data (Synthetic). We find a consistent stability hierarchy: Synthetic content aligns closely with factual representations and induces the largest retractions of previously held beliefs, producing up to 32.7%32.7\% retractions in representational evaluations and up to 36.3%36.3\% in behavioral evaluations. By contrast, Fictional content is more representationally distinct and comparatively stable. Together, these results suggest that epistemic familiarity is a robust signal across instantiations of belief stability under semantic reframing, complementing accuracy-based factuality evaluation with a notion of epistemic robustness.

View on arXiv
Comments on this paper