53
1

Peering Behind the Shield: Guardrail Identification in Large Language Models

Abstract

Human-AI conversations have gained increasing attention since the era of large language models. Consequently, more techniques, such as input/output guardrails and safety alignment, are proposed to prevent potential misuse of such Human-AI conversations. However, the ability to identify these guardrails has significant implications, both for adversarial exploitation and for auditing purposes by red team operators. In this work, we propose a novel method, AP-Test, which identifies the presence of a candidate guardrail by leveraging guardrail-specific adversarial prompts to query the AI agent. Extensive experiments of four candidate guardrails under diverse scenarios showcase the effectiveness of our method. The ablation study further illustrates the importance of the components we designed, such as the loss terms.

View on arXiv
@article{yang2025_2502.01241,
  title={ Peering Behind the Shield: Guardrail Identification in Large Language Models },
  author={ Ziqing Yang and Yixin Wu and Rui Wen and Michael Backes and Yang Zhang },
  journal={arXiv preprint arXiv:2502.01241},
  year={ 2025 }
}
Comments on this paper