Is poisoning a real threat to LLM alignment? Maybe more so than you think

Is poisoning a real threat to LLM alignment? Maybe more so than you think

17 June 2024

Pankayaraj Pathmanathan

Souradip Chakraborty

Furong Huang

Papers citing "Is poisoning a real threat to LLM alignment? Maybe more so than you think"

7 / 7 papers shown

Title
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs Jan Betley Daniel Tan Niels Warncke Anna Sztyber-Betley Xuchan Bao Martín Soto Nathan Labenz Owain Evans AAML 73 8 0 24 Feb 2025
UNIDOOR: A Universal Framework for Action-Level Backdoor Attacks in Deep Reinforcement Learning Oubo Ma L. Du Yang Dai Chunyi Zhou Qingming Li Yuwen Pu Shouling Ji 36 0 0 28 Jan 2025
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment Pankayaraj Pathmanathan Udari Madhushani Sehwag Michael-Andrei Panaitescu-Liess Furong Huang SILM AAML 35 0 0 15 Oct 2024
Recent Advances in Attack and Defense Approaches of Large Language Models Jing Cui Yishi Xu Zhewei Huang Shuchang Zhou Jianbin Jiao Junge Zhang PILM AAML 45 1 0 05 Sep 2024
LESS: Selecting Influential Data for Targeted Instruction Tuning Mengzhou Xia Sadhika Malladi Suchin Gururangan Sanjeev Arora Danqi Chen 68 180 0 06 Feb 2024
Poisoning Language Models During Instruction Tuning Alexander Wan Eric Wallace Sheng Shen Dan Klein SILM 90 124 0 01 May 2023
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 301 11,730 0 04 Mar 2022