Coercing LLMs to do and reveal (almost) anything

21 February 2024

Papers citing "Coercing LLMs to do and reveal (almost) anything"

40 / 40 papers shown

Title
The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them) Zihao Wang Yibo Jiang Jiahao Yu Heqing Huang 33 0 0 01 May 2025
Augmented Adversarial Trigger Learning Zhe Wang Yanjun Qi 46 0 0 16 Mar 2025
Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search Andy Zhou MU 67 0 0 13 Mar 2025
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models Alberto Purpura Sahil Wadhwa Jesse Zymet Akshay Gupta Andy Luo Melissa Kazemi Rad Swapnil Shinde Mohammad Sorower AAML 71 0 0 03 Mar 2025
Has My System Prompt Been Used? Large Language Model Prompt Membership Inference Roman Levin Valeriia Cherepanova Abhimanyu Hans Avi Schwarzschild Tom Goldstein 59 1 0 14 Feb 2025
A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks Hieu Minh "Jord" Nguyen LM&MA LRM 49 0 0 10 Feb 2025
OverThink: Slowdown Attacks on Reasoning LLMs A. Kumar Jaechul Roh A. Naseh Marzena Karpinska Mohit Iyyer Amir Houmansadr Eugene Bagdasarian LRM 57 12 0 04 Feb 2025
Lessons From Red Teaming 100 Generative AI Products Blake Bullwinkel Amanda Minnich Shiven Chawla Gary Lopez Martin Pouliot ... Pete Bryan Ram Shankar Siva Kumar Yonatan Zunger Chang Kawaguchi Mark Russinovich AAML VLM 37 4 0 13 Jan 2025
Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In Itay Nakash George Kour Guy Uziel Ateret Anaby-Tavor AAML LLMAG 21 4 0 22 Oct 2024
Bayesian scaling laws for in-context learning Aryaman Arora Dan Jurafsky Christopher Potts Noah D. Goodman 24 2 0 21 Oct 2024
GlitchMiner: Mining Glitch Tokens in Large Language Models via Gradient-based Discrete Optimization Zihui Wu Haichang Gao Ping Wang Shudong Zhang Zhaoxiang Liu Shiguo Lian 21 0 0 19 Oct 2024
SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization Akrit Mudvari Yuang Jiang Leandros Tassiulas 25 0 0 14 Oct 2024
Towards Assurance of LLM Adversarial Robustness using Ontology-Driven Argumentation Tomas Bueno Momcilovic Beat Buesser Giulio Zizzo Mark Purcell Tomas Bueno Momcilovic AAML 20 2 0 10 Oct 2024
Towards Assuring EU AI Act Compliance and Adversarial Robustness of LLMs Tomas Bueno Momcilovic Beat Buesser Giulio Zizzo Mark Purcell Dian Balta AAML 25 2 0 04 Oct 2024
Knowledge-Augmented Reasoning for EUAIA Compliance and Adversarial Robustness of LLMs Tomas Bueno Momcilovic Dian Balta Beat Buesser Giulio Zizzo Mark Purcell AAML 19 0 0 04 Oct 2024
Developing Assurance Cases for Adversarial Robustness and Regulatory Compliance in LLMs Tomas Bueno Momcilovic Dian Balta Beat Buesser Giulio Zizzo Mark Purcell AAML 16 0 0 04 Oct 2024
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester Maya Pavlova Erik Brinkman Krithika Iyer Vítor Albiero Joanna Bitton Hailey Nguyen J. Li Cristian Canton Ferrer Ivan Evtimov Aaron Grattafiori ALM 26 6 0 02 Oct 2024
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training Pavel Chizhov Catherine Arnett Elizaveta Korotkova Ivan P. Yamshchikov 37 2 0 06 Sep 2024
Compromising Embodied Agents with Contextual Backdoor Attacks Aishan Liu Yuguang Zhou Xianglong Liu Tianyuan Zhang Siyuan Liang ... Tianlin Li Junqi Zhang Wenbo Zhou Qing-Wu Guo Dacheng Tao LLMAG AAML 29 1 0 06 Aug 2024
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs Jingtong Su Mingyu Lee SangKeun Lee 30 7 0 02 Aug 2024
Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data? J. Hayase Alisa Liu Yejin Choi Sewoong Oh Noah A. Smith 27 9 0 23 Jul 2024
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs) Apurv Verma Satyapriya Krishna Sebastian Gehrmann Madhavan Seshadri Anu Pradhan Tom Ault Leslie Barrett David Rabinowitz John Doucette Nhathai Phan 47 8 0 20 Jul 2024
Jailbreak Attacks and Defenses Against Large Language Models: A Survey Sibo Yi Yule Liu Zhen Sun Tianshuo Cong Xinlei He Jiaxing Song Ke Xu Qi Li AAML 34 77 0 05 Jul 2024
Single Character Perturbations Break LLM Alignment Leon Lin Hannah Brown Kenji Kawaguchi Michael Shieh AAML 35 2 0 03 Jul 2024
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models Haibo Jin Leyang Hu Xinuo Li Peiyan Zhang Chonghan Chen Jun Zhuang Haohan Wang PILM 36 26 0 26 Jun 2024
Steering Without Side Effects: Improving Post-Deployment Control of Language Models Asa Cooper Stickland Alexander Lyzhov Jacob Pfau Salsabila Mahdi Samuel R. Bowman LLMSV AAML 52 17 0 21 Jun 2024
AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents Edoardo Debenedetti Jie Zhang Mislav Balunović Luca Beurer-Kellner Marc Fischer Florian Tramèr LLMAG AAML 43 25 1 19 Jun 2024
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models Delong Ran Jinyuan Liu Yichen Gong Jingyi Zheng Xinlei He Tianshuo Cong Anyu Wang ELM 42 10 0 13 Jun 2024
AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways Zehang Deng Yongjian Guo Changzhou Han Wanlun Ma Junwu Xiong Sheng Wen Yang Xiang 42 19 0 04 Jun 2024
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models Sander Land Max Bartolo 21 20 0 08 May 2024
Can Large Language Models put 2 and 2 together? Probing for Entailed Arithmetical Relationships D. Panas S. Seth V. Belle ReLM LRM 18 2 0 30 Apr 2024
Rethinking LLM Memorization through the Lens of Adversarial Compression Avi Schwarzschild Zhili Feng Pratyush Maini Zachary Chase Lipton J. Zico Kolter 39 38 0 23 Apr 2024
Pixels and Predictions: Potential of GPT-4V in Meteorological Imagery Analysis and Forecast Communication John R. Lawson Montgomery Flora Kevin H. Goebbert Seth N. Lyman Corey K. Potvin David M. Schultz Adam J. Stepanek Joseph E. Trujillo-Falcón MLLM 34 1 0 22 Apr 2024
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions Eric Wallace Kai Y. Xiao R. Leike Lilian Weng Johannes Heidecke Alex Beutel SILM 47 113 0 19 Apr 2024
Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra Darioush Kevian U. Syed Xing-ming Guo Aaron J. Havens Geir Dullerud Peter M. Seiler Lianhui Qin Bin Hu ELM 31 29 0 04 Apr 2024
Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs Aly M. Kassem Omar Mahmoud Niloofar Mireshghallah Hyunwoo J. Kim Yulia Tsvetkov Yejin Choi Sherif Saad Santu Rana 47 18 0 05 Mar 2024
Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning Adib Hasan Ileana Rugina Alex Wang AAML 47 22 0 19 Jan 2024
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts Jiahao Yu Xingwei Lin Zheng Yu Xinyu Xing SILM 110 292 0 19 Sep 2023
Training language models to follow instructions with human feedback Long Ouyang Jeff Wu Xu Jiang Diogo Almeida Carroll L. Wainwright ... Amanda Askell Peter Welinder Paul Christiano Jan Leike Ryan J. Lowe OSLM ALM 301 11,730 0 04 Mar 2022
Gradient-based Adversarial Attacks against Text Transformers Chuan Guo Alexandre Sablayrolles Hervé Jégou Douwe Kiela SILM 98 225 0 15 Apr 2021