Leveraging the Context through Multi-Round Interactions for Jailbreaking
Attacks

Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks

14 February 2024

Markos Georgopoulos

Grigorios G. Chrysos

Papers citing "Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks"

15 / 15 papers shown

Title
REVEAL: Multi-turn Evaluation of Image-Input Harms for Vision LLM Madhur Jindal Saurabh Deshpande AAML 43 0 0 07 May 2025
Bayesian Optimization of Robustness Measures Using Randomized GP-UCB-based Algorithms under Input Uncertainty Yu Inatsu 36 0 0 04 Apr 2025
On the Robustness of Transformers against Context Hijacking for Linear Classification Tianle Li Chenyang Zhang Xingwu Chen Yuan Cao Difan Zou 67 0 0 24 Feb 2025
Preventing Jailbreak Prompts as Malicious Tools for Cybercriminals: A Cyber Defense Perspective Jean Marie Tshimula Xavier Ndona D'Jeff K. Nkashama Pierre Martin Tardif F. Kabanza Marc Frappier Shengrui Wang SILM 81 0 0 25 Nov 2024
Diversity Helps Jailbreak Large Language Models Weiliang Zhao Daniel Ben-Levi Wei Hao Junfeng Yang Chengzhi Mao AAML 62 0 0 06 Nov 2024
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents Priyanshu Kumar Elaine Lau Saranya Vijayakumar Tu Trinh Scale Red Team ... Sean Hendryx Shuyan Zhou Matt Fredrikson Summer Yue Zifan Wang LLMAG 27 17 0 11 Oct 2024
You Know What I'm Saying: Jailbreak Attack via Implicit Reference Tianyu Wu Lingrui Mei Ruibin Yuan Lujun Li Wei Xue Yike Guo 35 1 0 04 Oct 2024
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens Lin Lu Hai Yan Zenghui Yuan Jiawen Shi Wenqi Wei Pin-Yu Chen Pan Zhou AAML 44 8 0 06 Jun 2024
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks Maksym Andriushchenko Francesco Croce Nicolas Flammarion AAML 81 155 0 02 Apr 2024
Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack M. Russinovich Ahmed Salem Ronen Eldan 28 75 0 02 Apr 2024
How (un)ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries Somnath Banerjee Sayan Layek Rima Hazra Animesh Mukherjee 24 10 0 23 Feb 2024
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts Jiahao Yu Xingwei Lin Zheng Yu Xinyu Xing SILM 110 292 0 19 Sep 2023
Assessing the impact of contextual information in hate speech detection Juan Manuel Pérez Franco Luque Demián Zayat Martín Kondratzky Agustín Moro ... Joaquín Zajac Paula Miguel Natalia Debandi Agustin Gravano Viviana Cotik 24 29 0 02 Oct 2022
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned Deep Ganguli Liane Lovitt John Kernion Amanda Askell Yuntao Bai ... Nicholas Joseph Sam McCandlish C. Olah Jared Kaplan Jack Clark 218 441 0 23 Aug 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Jason W. Wei Xuezhi Wang Dale Schuurmans Maarten Bosma Brian Ichter F. Xia Ed H. Chi Quoc Le Denny Zhou LM&Ro LRM AI4CE ReLM 315 8,261 0 28 Jan 2022