Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning
- KELMHILMCLL
Main:24 Pages
13 Figures
Bibliography:3 Pages
13 Tables
Abstract
Large language models (LLMs) continually evolve through pre-training on ever-expanding web data, but this adaptive process also exposes them to subtle forms of misinformation. While prior work has explored data poisoning during static pre-training, the effects of such manipulations under continual pre-training remain largely unexplored. Drawing inspiration from the illusory truth effect in human cognition - where repeated exposure to falsehoods increases belief in their accuracy - we ask whether LLMs exhibit a similar vulnerability. We investigate whether repeated exposure to false but confidently stated facts can shift a model's internal representation away from the truth.
View on arXivComments on this paper
