Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs
against Jailbreak Attacks

Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks

30 May 2024

Tsung-Yi Ho

Papers citing "Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks"

6 / 6 papers shown

Title
T2VShield: Model-Agnostic Jailbreak Defense for Text-to-Video Models Siyuan Liang Jiayang Liu Jiecheng Zhai Tianmeng Fang Rongcheng Tu A. Liu Xiaochun Cao Dacheng Tao VGen 49 0 0 22 Apr 2025
EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety Jiahao Qiu Yinghui He Xinzhe Juan Y. Wang Y. Liu Zixin Yao Yue Wu Xun Jiang L. Yang Mengdi Wang AI4MH 65 0 0 13 Apr 2025
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond Shanshan Han 66 1 0 09 Oct 2024
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner Xunguang Wang Daoyuan Wu Zhenlan Ji Zongjie Li Pingchuan Ma Shuai Wang Yingjiu Li Yang Liu Ning Liu Juergen Rahmel AAML 66 6 0 08 Jun 2024
Summarization is (Almost) Dead Xiao Pu Mingqi Gao Xiaojun Wan HILM 65 38 0 18 Sep 2023
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 301 11,730 0 04 Mar 2022