All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks

18 January 2024

Papers citing "All in How You Ask for It: Simple Black-Box Method for Jailbreak Attacks"

7 / 7 papers shown

Title
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models Zhi-Yi Chin Kuan-Chen Mu Mario Fritz Pin-Yu Chen DiffM 83 0 0 25 Nov 2024
Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring Honglin Mu Han He Yuxin Zhou Yunlong Feng Yang Xu ... Zeming Liu Xudong Han Qi Shi Qingfu Zhu Wanxiang Che AAML 28 1 0 28 Oct 2024
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks Maksym Andriushchenko Francesco Croce Nicolas Flammarion AAML 79 155 0 02 Apr 2024
Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks Erfan Shayegani Md Abdullah Al Mamun Yu Fu Pedram Zaree Yue Dong Nael B. Abu-Ghazaleh AAML 138 139 0 16 Oct 2023
A Review of ChatGPT Applications in Education, Marketing, Software Engineering, and Healthcare: Benefits, Drawbacks, and Research Directions Mohammad Fraiwan Natheer Khasawneh 38 35 0 29 Apr 2023
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 301 11,730 0 04 Mar 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models Jason W. Wei Xuezhi Wang Dale Schuurmans Maarten Bosma Brian Ichter F. Xia Ed H. Chi Quoc Le Denny Zhou LM&Ro LRM AI4CE ReLM 315 8,261 0 28 Jan 2022