Representational and Behavioral Stability of Truth in Large Language Models
- HILM
Large language models (LLMs) are increasingly used as information sources, yet small changes in semantic framing can destabilize their truth judgments. We propose P-StaT (Perturbation Stability of Truth), an evaluation framework for testing belief stability under controlled semantic perturbations in representational and behavioral settings via probing and zero-shot prompting. Across sixteen open-source LLMs and three domains, we compare perturbations involving epistemically familiar Neither statements drawn from well-known fictional contexts (Fictional) to those involving unfamiliar Neither statements not seen in training data (Synthetic). We find a consistent stability hierarchy: Synthetic content aligns closely with factual representations and induces the largest retractions of previously held beliefs, producing up to retractions in representational evaluations and up to in behavioral evaluations. By contrast, Fictional content is more representationally distinct and comparatively stable. Together, these results suggest that epistemic familiarity is a robust signal across instantiations of belief stability under semantic reframing, complementing accuracy-based factuality evaluation with a notion of epistemic robustness.
View on arXiv