v1v2v3v4v5 (latest)

Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs

23 May 2025

Papers citing "Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs"

17 / 17 papers shown

When Developer Aid Becomes Security Debt: A Systematic Analysis of Insecure Behaviors in LLM Coding Agents

Matous Kozak

Roshanak Zilouchian Moghaddam

Siva Sivaraman

LLMAG ELM

208

12 Jul 2025

Injecting Universal Jailbreak Backdoors into LLMs in MinutesInternational Conference on Learning Representations (ICLR), 2025

Zhuowei Chen

Qiannan Zhang

Shichao Pei

151

09 Feb 2025

DuoGuard: A Two-Player RL-Driven Framework for Multilingual LLM Guardrails

248

07 Feb 2025

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team GLM

Aohan Zeng

Bin Xu

Bowen Wang

...

Zhaoyu Wang

Zhen Yang

Zhengxiao Du

Zhenyu Hou

Zihan Wang

ALM

371

1,167

18 Jun 2024

Safety Alignment Should Be Made More Than Just a Few Tokens DeepInternational Conference on Learning Representations (ICLR), 2024

240

274

10 Jun 2024

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive AttacksInternational Conference on Learning Representations (ICLR), 2024

Maksym Andriushchenko

Francesco Croce

Nicolas Flammarion

AAML

780

367

02 Apr 2024

BadEdit: Backdooring large language models by model editing

Yang Liu

229

20 Mar 2024

Shortcuts Everywhere and Nowhere: Exploring Multi-Trigger Backdoor AttacksIEEE Transactions on Dependable and Secure Computing (IEEE TDSC), 2024

306

27 Jan 2024

BadChain: Backdoor Chain-of-Thought Prompting for Large Language ModelsInternational Conference on Learning Representations (ICLR), 2024

Zhen Xiang

Fengqing Jiang

Zidi Xiong

Bhaskar Ramasubramanian

Radha Poovendran

Bo Li

LRM SILM

280

20 Jan 2024

Universal Jailbreak Backdoors from Poisoned Human FeedbackInternational Conference on Learning Representations (ICLR), 2023

Javier Rando

Florian Tramèr

378

108

24 Nov 2023

Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

Zeming Wei

369

403

10 Oct 2023

Backdooring Instruction-Tuned Large Language Models with Virtual Prompt InjectionNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023

Vikas Yadav

Xiang Ren

358

153

31 Jul 2023

Universal and Transferable Adversarial Attacks on Aligned Language Models

J. Zico Kolter

623

2,304

27 Jul 2023

BackdoorBench: A Comprehensive Benchmark of Backdoor LearningNeural Information Processing Systems (NeurIPS), 2022

Hongrui Chen

317

187

25 Jun 2022

Language Models are Few-Shot LearnersNeural Information Processing Systems (NeurIPS), 2020

...

2.0K

52,526

28 May 2020

Weight Poisoning Attacks on Pre-trained ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2020

Keita Kurita

Paul Michel

Graham Neubig

AAML SILM

298

533

14 Apr 2020

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

578

2,024

22 Aug 2017