Universal and Transferable Adversarial Attacks on Aligned Language Models

27 July 2023

J. Zico Kolter

Papers citing "Universal and Transferable Adversarial Attacks on Aligned Language Models"

38 / 938 papers shown

Title
Certifying LLM Safety against Adversarial Prompting Aounon Kumar Chirag Agarwal Suraj Srinivas Aaron Jiaxun Li S. Feizi Himabindu Lakkaraju AAML 22 161 0 06 Sep 2023
Demystifying RCE Vulnerabilities in LLM-Integrated Apps Tong Liu Zizhuang Deng Guozhu Meng Yuekang Li Kai Chen SILM 34 19 0 06 Sep 2023
Open Sesame! Universal Black Box Jailbreaking of Large Language Models Raz Lapid Ron Langberg Moshe Sipper AAML 11 103 0 04 Sep 2023
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models Yue Zhang Yafu Li Leyang Cui Deng Cai Lemao Liu ... Longyue Wang A. Luu Wei Bi Freda Shi Shuming Shi RALM LRM HILM 41 518 0 03 Sep 2023
Robust and Efficient Interference Neural Networks for Defending Against Adversarial Attacks in ImageNet Yunuo Xiong Shujuan Liu H. Xiong AAML 19 0 0 03 Sep 2023
Baseline Defenses for Adversarial Attacks Against Aligned Language Models Neel Jain Avi Schwarzschild Yuxin Wen Gowthami Somepalli John Kirchenbauer Ping Yeh-Chiang Micah Goldblum Aniruddha Saha Jonas Geiping Tom Goldstein AAML 31 335 0 01 Sep 2023
Why do universal adversarial attacks work on large language models?: Geometry might be the answer Varshini Subhash Anna Bialas Weiwei Pan Finale Doshi-Velez AAML 14 10 0 01 Sep 2023
Image Hijacks: Adversarial Images can Control Generative Models at Runtime Luke Bailey Euan Ong Stuart J. Russell Scott Emmons VLM MLLM 16 78 0 01 Sep 2023
Large language models in medicine: the potentials and pitfalls J. Omiye Haiwen Gui Shawheen J. Rezaei James Zou Roxana Daneshjou LM&MA 16 64 0 31 Aug 2023
A Classification-Guided Approach for Adversarial Attacks against Neural Machine Translation Sahar Sadrizadeh Ljiljana Dolamic P. Frossard AAML SILM 21 2 0 29 Aug 2023
Identifying and Mitigating the Security Risks of Generative AI Clark W. Barrett Bradley L Boyd Ellie Burzstein Nicholas Carlini Brad Chen ... Zulfikar Ramzan Khawaja Shams D. Song Ankur Taly Diyi Yang SILM 24 89 0 28 Aug 2023
AI Deception: A Survey of Examples, Risks, and Potential Solutions Peter S. Park Simon Goldstein Aidan O'Gara Michael Chen Dan Hendrycks 25 137 0 28 Aug 2023
Detecting Language Model Attacks with Perplexity Gabriel Alon Michael Kamfonas AAML 40 176 0 27 Aug 2023
Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities Maximilian Mozes Xuanli He Bennett Kleinberg Lewis D. Griffin 31 76 0 24 Aug 2023
Adversarial Illusions in Multi-Modal Embeddings Tingwei Zhang Rishi Jha Eugene Bagdasaryan Vitaly Shmatikov AAML 24 8 0 22 Aug 2023
Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions Pouya Pezeshkpour Estevam R. Hruschka LRM 8 124 0 22 Aug 2023
Enhancing Adversarial Attacks: The Similar Target Method Shuo Zhang Ziruo Wang Zikai Zhou Huanran Chen AAML 44 1 0 21 Aug 2023
On the Adversarial Robustness of Multi-Modal Foundation Models Christian Schlarmann Matthias Hein AAML 105 84 0 21 Aug 2023
DUAW: Data-free Universal Adversarial Watermark against Stable Diffusion Customization Xiaoyu Ye Hao Huang Jiaqi An Yongtao Wang WIGM 24 22 0 19 Aug 2023
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment Rishabh Bhardwaj Soujanya Poria ELM 17 127 0 18 Aug 2023
Position: Key Claims in LLM Research Have a Long Tail of Footnotes Anna Rogers A. Luccioni 40 19 0 14 Aug 2023
GPT-4 Is Too Smart To Be Safe: Stealthy Chat with LLMs via Cipher Youliang Yuan Wenxiang Jiao Wenxuan Wang Jen-tse Huang Pinjia He Shuming Shi Zhaopeng Tu SILM 61 231 0 12 Aug 2023
CLEVA: Chinese Language Models EVAluation Platform Yanyang Li Jianqiao Zhao Duo Zheng Zi-Yuan Hu Zhi Chen ... Yongfeng Huang Shijia Huang Dahua Lin Michael R. Lyu Liwei Wang ALM ELM 33 9 0 09 Aug 2023
"Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models Xinyue Shen Z. Chen Michael Backes Yun Shen Yang Zhang SILM 33 243 0 07 Aug 2023
Why We Don't Have AGI Yet Peter Voss M. Jovanovic VLM 11 2 0 07 Aug 2023
XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models Paul Röttger Hannah Rose Kirk Bertie Vidgen Giuseppe Attanasio Federico Bianchi Dirk Hovy ALM ELM AILaw 21 122 0 02 Aug 2023
Getting from Generative AI to Trustworthy AI: What LLMs might learn from Cyc D. Lenat G. Marcus 9 29 0 31 Jul 2023
Jailbreak in pieces: Compositional Adversarial Attacks on Multi-Modal Language Models Erfan Shayegani Yue Dong Nael B. Abu-Ghazaleh 20 126 0 26 Jul 2023
Effective Prompt Extraction from Language Models Yiming Zhang Nicholas Carlini Daphne Ippolito MIACV SILM 25 35 0 13 Jul 2023
Visual Adversarial Examples Jailbreak Aligned Large Language Models Xiangyu Qi Kaixuan Huang Ashwinee Panda Peter Henderson Mengdi Wang Prateek Mittal AAML 23 136 0 22 Jun 2023
Explore, Establish, Exploit: Red Teaming Language Models from Scratch Stephen Casper Jason Lin Joe Kwon Gatlen Culp Dylan Hadfield-Menell AAML 8 83 0 15 Jun 2023
Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks Abhinav Rao S. Vashistha Atharva Naik Somak Aditya Monojit Choudhury 25 17 0 24 May 2023
Improving alignment of dialogue agents via targeted human judgements Amelia Glaese Nat McAleese Maja Trkebacz John Aslanides Vlad Firoiu ... John F. J. Mellor Demis Hassabis Koray Kavukcuoglu Lisa Anne Hendricks G. Irving ALM AAML 225 500 0 28 Sep 2022
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 303 11,881 0 04 Mar 2022
The Power of Scale for Parameter-Efficient Prompt Tuning Brian Lester Rami Al-Rfou Noah Constant VPVLM 280 3,835 0 18 Apr 2021
Gradient-based Adversarial Attacks against Text Transformers Chuan Guo Alexandre Sablayrolles Hervé Jégou Douwe Kiela SILM 98 227 0 15 Apr 2021
Globally-Robust Neural Networks Klas Leino Zifan Wang Matt Fredrikson AAML OOD 80 125 0 16 Feb 2021
Generating Natural Language Adversarial Examples M. Alzantot Yash Sharma Ahmed Elgohary Bo-Jhang Ho Mani B. Srivastava Kai-Wei Chang AAML 243 914 0 21 Apr 2018