Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.14020
Cited By
Coercing LLMs to do and reveal (almost) anything
21 February 2024
Jonas Geiping
Alex Stein
Manli Shu
Khalid Saifullah
Yuxin Wen
Tom Goldstein
AAML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Coercing LLMs to do and reveal (almost) anything"
40 / 40 papers shown
Title
The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them)
Zihao Wang
Yibo Jiang
Jiahao Yu
Heqing Huang
33
0
0
01 May 2025
Augmented Adversarial Trigger Learning
Zhe Wang
Yanjun Qi
46
0
0
16 Mar 2025
Siege: Autonomous Multi-Turn Jailbreaking of Large Language Models with Tree Search
Andy Zhou
MU
67
0
0
13 Mar 2025
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models
Alberto Purpura
Sahil Wadhwa
Jesse Zymet
Akshay Gupta
Andy Luo
Melissa Kazemi Rad
Swapnil Shinde
Mohammad Sorower
AAML
71
0
0
03 Mar 2025
Has My System Prompt Been Used? Large Language Model Prompt Membership Inference
Roman Levin
Valeriia Cherepanova
Abhimanyu Hans
Avi Schwarzschild
Tom Goldstein
59
1
0
14 Feb 2025
A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks
Hieu Minh "Jord" Nguyen
LM&MA
LRM
49
0
0
10 Feb 2025
OverThink: Slowdown Attacks on Reasoning LLMs
A. Kumar
Jaechul Roh
A. Naseh
Marzena Karpinska
Mohit Iyyer
Amir Houmansadr
Eugene Bagdasarian
LRM
57
12
0
04 Feb 2025
Lessons From Red Teaming 100 Generative AI Products
Blake Bullwinkel
Amanda Minnich
Shiven Chawla
Gary Lopez
Martin Pouliot
...
Pete Bryan
Ram Shankar Siva Kumar
Yonatan Zunger
Chang Kawaguchi
Mark Russinovich
AAML
VLM
37
4
0
13 Jan 2025
Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In
Itay Nakash
George Kour
Guy Uziel
Ateret Anaby-Tavor
AAML
LLMAG
21
4
0
22 Oct 2024
Bayesian scaling laws for in-context learning
Aryaman Arora
Dan Jurafsky
Christopher Potts
Noah D. Goodman
24
2
0
21 Oct 2024
GlitchMiner: Mining Glitch Tokens in Large Language Models via Gradient-based Discrete Optimization
Zihui Wu
Haichang Gao
Ping Wang
Shudong Zhang
Zhaoxiang Liu
Shiguo Lian
21
0
0
19 Oct 2024
SplitLLM: Collaborative Inference of LLMs for Model Placement and Throughput Optimization
Akrit Mudvari
Yuang Jiang
Leandros Tassiulas
25
0
0
14 Oct 2024
Towards Assurance of LLM Adversarial Robustness using Ontology-Driven Argumentation
Tomas Bueno Momcilovic
Beat Buesser
Giulio Zizzo
Mark Purcell
Tomas Bueno Momcilovic
AAML
20
2
0
10 Oct 2024
Towards Assuring EU AI Act Compliance and Adversarial Robustness of LLMs
Tomas Bueno Momcilovic
Beat Buesser
Giulio Zizzo
Mark Purcell
Dian Balta
AAML
25
2
0
04 Oct 2024
Knowledge-Augmented Reasoning for EUAIA Compliance and Adversarial Robustness of LLMs
Tomas Bueno Momcilovic
Dian Balta
Beat Buesser
Giulio Zizzo
Mark Purcell
AAML
19
0
0
04 Oct 2024
Developing Assurance Cases for Adversarial Robustness and Regulatory Compliance in LLMs
Tomas Bueno Momcilovic
Dian Balta
Beat Buesser
Giulio Zizzo
Mark Purcell
AAML
16
0
0
04 Oct 2024
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester
Maya Pavlova
Erik Brinkman
Krithika Iyer
Vítor Albiero
Joanna Bitton
Hailey Nguyen
J. Li
Cristian Canton Ferrer
Ivan Evtimov
Aaron Grattafiori
ALM
26
6
0
02 Oct 2024
BPE Gets Picky: Efficient Vocabulary Refinement During Tokenizer Training
Pavel Chizhov
Catherine Arnett
Elizaveta Korotkova
Ivan P. Yamshchikov
37
2
0
06 Sep 2024
Compromising Embodied Agents with Contextual Backdoor Attacks
Aishan Liu
Yuguang Zhou
Xianglong Liu
Tianyuan Zhang
Siyuan Liang
...
Tianlin Li
Junqi Zhang
Wenbo Zhou
Qing-Wu Guo
Dacheng Tao
LLMAG
AAML
29
1
0
06 Aug 2024
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Jingtong Su
Mingyu Lee
SangKeun Lee
30
7
0
02 Aug 2024
Data Mixture Inference: What do BPE Tokenizers Reveal about their Training Data?
J. Hayase
Alisa Liu
Yejin Choi
Sewoong Oh
Noah A. Smith
27
9
0
23 Jul 2024
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Apurv Verma
Satyapriya Krishna
Sebastian Gehrmann
Madhavan Seshadri
Anu Pradhan
Tom Ault
Leslie Barrett
David Rabinowitz
John Doucette
Nhathai Phan
47
8
0
20 Jul 2024
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Sibo Yi
Yule Liu
Zhen Sun
Tianshuo Cong
Xinlei He
Jiaxing Song
Ke Xu
Qi Li
AAML
34
77
0
05 Jul 2024
Single Character Perturbations Break LLM Alignment
Leon Lin
Hannah Brown
Kenji Kawaguchi
Michael Shieh
AAML
35
2
0
03 Jul 2024
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
Haibo Jin
Leyang Hu
Xinuo Li
Peiyan Zhang
Chonghan Chen
Jun Zhuang
Haohan Wang
PILM
36
26
0
26 Jun 2024
Steering Without Side Effects: Improving Post-Deployment Control of Language Models
Asa Cooper Stickland
Alexander Lyzhov
Jacob Pfau
Salsabila Mahdi
Samuel R. Bowman
LLMSV
AAML
52
17
0
21 Jun 2024
AgentDojo: A Dynamic Environment to Evaluate Attacks and Defenses for LLM Agents
Edoardo Debenedetti
Jie Zhang
Mislav Balunović
Luca Beurer-Kellner
Marc Fischer
Florian Tramèr
LLMAG
AAML
43
25
1
19 Jun 2024
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
Delong Ran
Jinyuan Liu
Yichen Gong
Jingyi Zheng
Xinlei He
Tianshuo Cong
Anyu Wang
ELM
42
10
0
13 Jun 2024
AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways
Zehang Deng
Yongjian Guo
Changzhou Han
Wanlun Ma
Junwu Xiong
Sheng Wen
Yang Xiang
42
19
0
04 Jun 2024
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
Sander Land
Max Bartolo
21
20
0
08 May 2024
Can Large Language Models put 2 and 2 together? Probing for Entailed Arithmetical Relationships
D. Panas
S. Seth
V. Belle
ReLM
LRM
18
2
0
30 Apr 2024
Rethinking LLM Memorization through the Lens of Adversarial Compression
Avi Schwarzschild
Zhili Feng
Pratyush Maini
Zachary Chase Lipton
J. Zico Kolter
39
38
0
23 Apr 2024
Pixels and Predictions: Potential of GPT-4V in Meteorological Imagery Analysis and Forecast Communication
John R. Lawson
Montgomery Flora
Kevin H. Goebbert
Seth N. Lyman
Corey K. Potvin
David M. Schultz
Adam J. Stepanek
Joseph E. Trujillo-Falcón
MLLM
34
1
0
22 Apr 2024
The Instruction Hierarchy: Training LLMs to Prioritize Privileged Instructions
Eric Wallace
Kai Y. Xiao
R. Leike
Lilian Weng
Johannes Heidecke
Alex Beutel
SILM
47
113
0
19 Apr 2024
Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra
Darioush Kevian
U. Syed
Xing-ming Guo
Aaron J. Havens
Geir Dullerud
Peter M. Seiler
Lianhui Qin
Bin Hu
ELM
31
29
0
04 Apr 2024
Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs
Aly M. Kassem
Omar Mahmoud
Niloofar Mireshghallah
Hyunwoo J. Kim
Yulia Tsvetkov
Yejin Choi
Sherif Saad
Santu Rana
47
18
0
05 Mar 2024
Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning
Adib Hasan
Ileana Rugina
Alex Wang
AAML
47
22
0
19 Jan 2024
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Jiahao Yu
Xingwei Lin
Zheng Yu
Xinyu Xing
SILM
110
292
0
19 Sep 2023
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
Gradient-based Adversarial Attacks against Text Transformers
Chuan Guo
Alexandre Sablayrolles
Hervé Jégou
Douwe Kiela
SILM
98
225
0
15 Apr 2021
1