Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning

29 October 2025

S. Churina

ArXiv (abs)PDF HTML Github

Main:24 Pages

13 Figures

Bibliography:3 Pages

13 Tables

Abstract

Large language models (LLMs) continually evolve through pre-training on ever-expanding web data, but this adaptive process also exposes them to subtle forms of misinformation. While prior work has explored data poisoning during static pre-training, the effects of such manipulations under continual pre-training remain largely unexplored. Drawing inspiration from the illusory truth effect in human cognition - where repeated exposure to falsehoods increases belief in their accuracy - we ask whether LLMs exhibit a similar vulnerability. We investigate whether repeated exposure to false but confidently stated facts can shift a model's internal representation away from the truth.

View on arXiv

Comments on this paper