Defending against Indirect Prompt Injection by Instruction Detection

8 May 2025

Abstract

The integration of Large Language Models (LLMs) with external sources is becoming increasingly common, with Retrieval-Augmented Generation (RAG) being a prominent example. However, this integration introduces vulnerabilities of Indirect Prompt Injection (IPI) attacks, where hidden instructions embedded in external data can manipulate LLMs into executing unintended or harmful actions. We recognize that the success of IPI attacks fundamentally relies in the presence of instructions embedded within external content, which can alter the behavioral state of LLMs. Can effectively detecting such state changes help us defend against IPI attacks? In this paper, we propose a novel approach that takes external data as input and leverages the behavioral state of LLMs during both forward and backward propagation to detect potential IPI attacks. Specifically, we demonstrate that the hidden states and gradients from intermediate layers provide highly discriminative features for instruction detection. By effectively combining these features, our approach achieves a detection accuracy of 99.60\% in the in-domain setting and 96.90\% in the out-of-domain setting, while reducing the attack success rate to just 0.12\% on the BIPIA benchmark.

View on arXiv

@article{wen2025_2505.06311,
  title={ Defending against Indirect Prompt Injection by Instruction Detection },
  author={ Tongyu Wen and Chenglong Wang and Xiyuan Yang and Haoyu Tang and Yueqi Xie and Lingjuan Lyu and Zhicheng Dou and Fangzhao Wu },
  journal={arXiv preprint arXiv:2505.06311},
  year={ 2025 }
}

Comments on this paper