Testing the Limits of Jailbreaking Defenses with the Purple Problem

20 March 2024

Papers citing "Testing the Limits of Jailbreaking Defenses with the Purple Problem"

7 / 7 papers shown

Title
FLAME: Flexible LLM-Assisted Moderation Engine Ivan Bakulin Ilia Kopanichuk Iaroslav Bespalov Nikita Radchenko V. Shaposhnikov Dmitry V. Dylov Ivan Oseledets 74 0 0 13 Feb 2025
Endless Jailbreaks with Bijection Learning Brian R. Y. Huang Maximilian Li Leonard Tang AAML 51 5 0 02 Oct 2024
Attacking Large Language Models with Projected Gradient Descent Simon Geisler Tom Wollschlager M. H. I. Abdalla Johannes Gasteiger Stephan Günnemann AAML SILM 34 48 0 14 Feb 2024
On the Risk of Misinformation Pollution with Large Language Models Yikang Pan Liangming Pan Wenhu Chen Preslav Nakov Min-Yen Kan W. Wang DeLMO 188 105 0 23 May 2023
Critical Perspectives: A Benchmark Revealing Pitfalls in PerspectiveAPI Lorena Piedras Lucas Rosenblatt Julia Wilkins 21 7 0 05 Jan 2023
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 301 11,730 0 04 Mar 2022
Gradient-based Adversarial Attacks against Text Transformers Chuan Guo Alexandre Sablayrolles Hervé Jégou Douwe Kiela SILM 93 162 0 15 Apr 2021