Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2404.14461
Cited By
Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs
22 April 2024
Javier Rando
Francesco Croce
Kryvstof Mitka
Stepan Shabalin
Maksym Andriushchenko
Nicolas Flammarion
F. Tramèr
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs"
14 / 14 papers shown
Title
Towards Autonomous Reinforcement Learning for Real-World Robotic Manipulation with Large Language Models
Niccolò Turcato
Matteo Iovino
Aris Synodinos
Alberto Dalla Libera
R. Carli
Pietro Falco
LM&Ro
43
0
0
06 Mar 2025
Neutralizing Backdoors through Information Conflicts for Large Language Models
Chen Chen
Yuchen Sun
Xueluan Gong
Jiaxin Gao
K. Lam
KELM
AAML
69
0
0
27 Nov 2024
Hey GPT, Can You be More Racist? Analysis from Crowdsourced Attempts to Elicit Biased Content from Generative AI
Hangzhi Guo
Pranav Narayanan Venkit
Eunchae Jang
Mukund Srinath
Wenbo Zhang
Bonam Mingole
Vipul Gupta
Kush R. Varshney
S. Shyam Sundar
A. Yadav
46
3
0
20 Oct 2024
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment
Pankayaraj Pathmanathan
Udari Madhushani Sehwag
Michael-Andrei Panaitescu-Liess
Furong Huang
SILM
AAML
38
0
0
15 Oct 2024
Uncovering, Explaining, and Mitigating the Superficial Safety of Backdoor Defense
Rui Min
Zeyu Qin
Nevin L. Zhang
Li Shen
Minhao Cheng
AAML
31
4
0
13 Oct 2024
BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks on Large Language Models
Yige Li
Hanxun Huang
Yunhan Zhao
Xingjun Ma
Jun Sun
AAML
SILM
38
19
0
23 Aug 2024
BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models
Yi Zeng
Weiyu Sun
Tran Ngoc Huynh
Dawn Song
Bo Li
Ruoxi Jia
AAML
LLMSV
35
17
0
24 Jun 2024
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Edoardo Debenedetti
Javier Rando
Daniel Paleka
Silaghi Fineas Florin
Dragos Albastroiu
...
Stefan Kraft
Mario Fritz
Florian Tramèr
Sahar Abdelnabi
Lea Schonherr
46
9
0
12 Jun 2024
Unelicitable Backdoors in Language Models via Cryptographic Transformer Circuits
Andis Draguns
Andrew Gritsevskiy
S. Motwani
Charlie Rogers-Smith
Jeffrey Ladish
Christian Schroeder de Witt
40
2
0
03 Jun 2024
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Maksym Andriushchenko
Francesco Croce
Nicolas Flammarion
AAML
81
158
0
02 Apr 2024
Rethinking Machine Unlearning for Large Language Models
Sijia Liu
Yuanshun Yao
Jinghan Jia
Stephen Casper
Nathalie Baracaldo
...
Hang Li
Kush R. Varshney
Mohit Bansal
Sanmi Koyejo
Yang Liu
AILaw
MU
65
81
0
13 Feb 2024
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Boyi Wei
Kaixuan Huang
Yangsibo Huang
Tinghao Xie
Xiangyu Qi
Mengzhou Xia
Prateek Mittal
Mengdi Wang
Peter Henderson
AAML
55
79
0
07 Feb 2024
Poisoning Language Models During Instruction Tuning
Alexander Wan
Eric Wallace
Sheng Shen
Dan Klein
SILM
90
124
0
01 May 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
210
494
0
01 Nov 2022
1