Title
Attention Tracker: Detecting Prompt Injection Attacks in LLMs Kuo-Han Hung Ching-Yun Ko Ambrish Rawat I-Hsin Chung Winston H. Hsu Pin-Yu Chen 46 7 0 01 Nov 2024
Hey GPT, Can You be More Racist? Analysis from Crowdsourced Attempts to Elicit Biased Content from Generative AI Hangzhi Guo Pranav Narayanan Venkit Eunchae Jang Mukund Srinath Wenbo Zhang Bonam Mingole Vipul Gupta Kush R. Varshney S. Shyam Sundar A. Yadav 27 3 0 20 Oct 2024
Applying Refusal-Vector Ablation to Llama 3.1 70B Agents Simon Lermen Mateusz Dziemian Govind Pimpale LLMAG 15 4 0 08 Oct 2024
Position: LLM Unlearning Benchmarks are Weak Measures of Progress Pratiksha Thaker Shengyuan Hu Neil Kale Yash Maurya Zhiwei Steven Wu Virginia Smith MU 39 10 0 03 Oct 2024
Prompt Obfuscation for Large Language Models David Pape Thorsten Eisenhofer Thorsten Eisenhofer Lea Schönherr AAML 31 2 0 17 Sep 2024
AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents Edoardo Debenedetti Jie Zhang Mislav Balunović Luca Beurer-Kellner Marc Fischer Florian Tramèr LLMAG AAML 43 25 1 19 Jun 2024
Are you still on track!? Catching LLM Task Drift with Activations Sahar Abdelnabi Aideen Fay Giovanni Cherubin Ahmed Salem Mario Fritz Andrew J. Paverd 55 7 0 02 Jun 2024
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions Eric Wallace Kai Y. Xiao R. Leike Lilian Weng Johannes Heidecke Alex Beutel SILM 47 113 0 19 Apr 2024
Whispers in the Machine: Confidentiality in LLM-integrated Systems Jonathan Evertz Merlin Chlosta Lea Schonherr Thorsten Eisenhofer 69 15 0 10 Feb 2024