429

Layer of Truth: Probing Belief Shifts under Continual Pre-Training Poisoning

Main:24 Pages
13 Figures
Bibliography:3 Pages
13 Tables
Abstract

Large language models (LLMs) continually evolve through pre-training on ever-expanding web data, but this adaptive process also exposes them to subtle forms of misinformation. While prior work has explored data poisoning during static pre-training, the effects of such manipulations under continual pre-training remain largely unexplored. Drawing inspiration from the illusory truth effect in human cognition - where repeated exposure to falsehoods increases belief in their accuracy - we ask whether LLMs exhibit a similar vulnerability. We investigate whether repeated exposure to false but confidently stated facts can shift a model's internal representation away from the truth.

View on arXiv
Comments on this paper