143
v1v2 (latest)

Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems

Main:9 Pages
1 Figures
Bibliography:4 Pages
7 Tables
Abstract

This paper investigates defenses for LLM-based evaluation systems against prompt injection. We formalize a class of threats called blind attacks, where a candidate answer is crafted independently of the true answer to deceive the evaluator. To counter such attacks, we propose a framework that augments Standard Evaluation (SE) with Counterfactual Evaluation (CFE), which re-evaluates the submission against a deliberately false ground-truth answer. An attack is detected if the system validates an answer under both standard and counterfactual conditions. Experiments show that while standard evaluation is highly vulnerable, our SE+CFE framework significantly improves security by boosting attack detection with minimal performance trade-offs.

View on arXiv
Comments on this paper