Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2310.06387
Cited By
v1
v2
v3 (latest)
Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations
10 October 2023
Zeming Wei
Yifei Wang
Ang Li
Yichuan Mo
Yisen Wang
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations"
50 / 164 papers shown
Title
Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs
Hiroshi Matsuda
Chunpeng Ma
Masayuki Asahara
260
3
0
11 Jun 2025
VerIF: Verification Engineering for Reinforcement Learning in Instruction Following
Hao Peng
Yunjia Qi
Xiaozhi Wang
Bin Xu
Lei Hou
Juanzi Li
OffRL
249
5
0
11 Jun 2025
TwinBreak: Jailbreaking LLM Security Alignments based on Twin Prompts
T. Krauß
Hamid Dashtbani
Alexandra Dmitrienko
121
5
0
09 Jun 2025
LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Haoyang Li
Huan Gao
Zhiyuan Zhao
Zhiyu Lin
Junyu Gao
Xuelong Li
AAML
124
2
0
09 Jun 2025
Enhancing the Safety of Medical Vision-Language Models by Synthetic Demonstrations
Zhiyu Xue
Reza Abbasi-Asl
Ramtin Pedarsani
AAML
123
0
0
08 Jun 2025
A Trustworthiness-based Metaphysics of Artificial Intelligence Systems
Conference on Fairness, Accountability and Transparency (FAccT), 2025
Andrea Ferrario
216
8
0
03 Jun 2025
Bootstrapping LLM Robustness for VLM Safety via Reducing the Pretraining Modality Gap
Wenhan Yang
Spencer Stice
Ali Payani
Baharan Mirzasoleiman
MLLM
199
1
0
30 May 2025
Learning Safety Constraints for Large Language Models
Xin Chen
Yarden As
Andreas Krause
141
7
0
30 May 2025
OMNIGUARD: An Efficient Approach for AI Safety Moderation Across Languages and Modalities
Sahil Verma
Keegan E. Hines
J. Bilmes
Charlotte Siska
Luke Zettlemoyer
Hila Gonen
Chandan Singh
AAML
373
5
0
29 May 2025
What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Sangyeop Kim
Yohan Lee
Yongwoo Song
Kimin Lee
AAML
171
0
0
26 May 2025
Benign-to-Toxic Jailbreaking: Inducing Harmful Responses from Harmless Prompts
H. Kim
Minbeom Kim
Wonjun Lee
Kihyun Kim
Changick Kim
153
0
0
26 May 2025
An Embarrassingly Simple Defense Against LLM Abliteration Attacks
Harethah Shairah
Hasan Hammoud
Bernard Ghanem
G. Turkiyyah
211
3
0
25 May 2025
Exploring the Vulnerability of the Content Moderation Guardrail in Large Language Models via Intent Manipulation
Jun Zhuang
Haibo Jin
Ye Zhang
Zhengjian Kang
Wenbin Zhang
Gaby G. Dagher
Haohan Wang
AAML
230
3
0
24 May 2025
Unveiling the Basin-Like Loss Landscape in Large Language Models
Huanran Chen
Yinpeng Dong
Zeming Wei
Yao Huang
Yichi Zhang
Hang Su
Jun Zhu
MoMe
361
5
0
23 May 2025
JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models
Zifan Peng
Yule Liu
Zhen Sun
Mingchen Li
Zeren Luo
...
Xinlei He
Xuechao Wang
Yingjie Xue
Shengmin Xu
Xinyi Huang
AuLLM
AAML
410
4
0
23 May 2025
Revisiting Backdoor Attacks on LLMs: A Stealthy and Practical Poisoning Framework via Harmless Inputs
Jiawei Kong
Hao Fang
Xiaochen Yang
Kuofeng Gao
Bin Chen
Shu-Tao Xia
Yaowei Wang
Min Zhang
AAML
287
2
0
23 May 2025
Advancing LLM Safe Alignment with Safety Representation Ranking
Tianqi Du
Zeming Wei
Quan Chen
Chenheng Zhang
Yisen Wang
ALM
166
5
0
21 May 2025
Chain-of-Thought Driven Adversarial Scenario Extrapolation for Robust Language Models
Md Rafi Ur Rashid
Vishnu Asutosh Dasu
Ye Wang
Gang Tan
Shagufta Mehnaz
AAML
ELM
340
0
0
20 May 2025
SPIRIT: Patching Speech Language Models against Jailbreak Attacks
Amirbek Djanibekov
Nurdaulet Mukhituly
Kentaro Inui
Hanan Aldarmaki
Nils Lukas
AAML
260
1
0
18 May 2025
PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Yidan Wang
Yanan Cao
Yubing Ren
Fang Fang
Zheng Lin
Binxing Fang
PILM
424
6
0
15 May 2025
One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models
Haoran Gu
Handing Wang
Yi Mei
Mengjie Zhang
Yaochu Jin
246
0
0
12 May 2025
Cannot See the Forest for the Trees: Invoking Heuristics and Biases to Elicit Irrational Choices of LLMs
Haoming Yang
Ke Ma
Xiaojun Jia
Yingfei Sun
Qianqian Xu
Qingming Huang
AAML
1.0K
3
0
03 May 2025
Transferable Adversarial Attacks on Black-Box Vision-Language Models
Kai Hu
Weichen Yu
Guang Dai
Alexander Robey
Andy Zou
Chengming Xu
Haoqi Hu
Matt Fredrikson
AAML
VLM
343
5
0
02 May 2025
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Hannah Cyberey
David Evans
LLMSV
452
7
0
23 Apr 2025
WASP: Benchmarking Web Agent Security Against Prompt Injection Attacks
Ivan Evtimov
Arman Zharmagambetov
Aaron Grattafiori
Chuan Guo
Kamalika Chaudhuri
ELM
375
37
0
22 Apr 2025
AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender
Weixiang Zhao
Jiahe Guo
Yulin Hu
Yang Deng
An Zhang
...
Xinyang Han
Yanyan Zhao
Bing Qin
Tat-Seng Chua
Ting Liu
LLMSV
AAML
325
11
0
13 Apr 2025
LightDefense: A Lightweight Uncertainty-Driven Defense against Jailbreaks via Shifted Token Distribution
Zhuoran Yang
Jie Peng
AAML
272
2
0
02 Apr 2025
Prompt Flow Integrity to Prevent Privilege Escalation in LLM Agents
Juhee Kim
Woohyuk Choi
Byoungyoung Lee
LLMAG
301
21
0
17 Mar 2025
Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning
Yiwei Chen
Yuguang Yao
Yihua Zhang
Bingquan Shen
Gaowen Liu
Sijia Liu
AAML
MU
289
6
0
14 Mar 2025
Dialogue Injection Attack: Jailbreaking LLMs through Context Manipulation
Wenlong Meng
Fan Zhang
Wendao Yao
Zhenyuan Guo
Yongqian Li
Chengkun Wei
Wenzhi Chen
AAML
282
8
0
11 Mar 2025
Adversarial Training for Multimodal Large Language Models against Jailbreak Attacks
Liming Lu
Shuchao Pang
Yaning Tan
Haotian Zhu
Xiyu Zeng
Aishan Liu
Yunhuai Liu
Yongbin Zhou
AAML
399
17
0
05 Mar 2025
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Zixuan Weng
Xiaolong Jin
Jinyuan Jia
Xinsong Zhang
AAML
709
10
0
27 Feb 2025
Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Shiyu Xiang
Ansen Zhang
Yanfei Cao
Yang Fan
Ronghao Chen
AAML
288
8
0
26 Feb 2025
GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods
Ruixuan Huang
Xunguang Wang
Zongjie Li
Daoyuan Wu
Shuai Wang
ALM
ELM
356
0
0
24 Feb 2025
On the Robustness of Transformers against Context Hijacking for Linear Classification
Tianle Li
Chenyang Zhang
Xingwu Chen
Yuan Cao
Difan Zou
353
3
0
24 Feb 2025
SafeInt: Shielding Large Language Models from Jailbreak Attacks via Safety-Aware Representation Intervention
Jiaqi Wu
Chen Chen
Chunyan Hou
Xiaojie Yuan
AAML
327
1
0
21 Feb 2025
A Mousetrap: Fooling Large Reasoning Models for Jailbreak with Chain of Iterative Chaos
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Yang Yao
Xuan Tong
Ruofan Wang
Yixu Wang
Lujundong Li
Liang Liu
Yan Teng
Yun Wang
LRM
272
19
0
19 Feb 2025
SafeEraser: Enhancing Safety in Multimodal Large Language Models through Multimodal Machine Unlearning
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Junkai Chen
Zhijie Deng
Kening Zheng
Yibo Yan
Qi Zheng
PeiJun Wu
Peijie Jiang
Qingbin Liu
Xuming Hu
MU
438
16
0
18 Feb 2025
StructTransform: A Scalable Attack Surface for Safety-Aligned Large Language Models
Shehel Yoosuf
Temoor Ali
Ahmed Lekssays
Mashael Alsabah
Issa M. Khalil
213
1
0
17 Feb 2025
JBShield: Defending Large Language Models from Jailbreak Attacks through Activated Concept Analysis and Manipulation
Shenyi Zhang
Yuchen Zhai
Keyan Guo
Hongxin Hu
Shengnan Guo
Zheng Fang
Lingchen Zhao
Chao Shen
Cong Wang
Qian Wang
AAML
243
19
0
11 Feb 2025
When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
Neural Information Processing Systems (NeurIPS), 2024
Xuan Chen
Yuzhou Nie
Wenbo Guo
Xiangyu Zhang
372
38
0
28 Jan 2025
Refining Input Guardrails: Enhancing LLM-as-a-Judge Efficiency Through Chain-of-Thought Fine-Tuning and Alignment
Melissa Kazemi Rad
Huy Nghiem
Andy Luo
Sahil Wadhwa
Mohammad Sorower
Stephen Rawls
AAML
266
16
0
22 Jan 2025
Episodic memory in AI agents poses risks that should be studied and mitigated
Chad DeChant
371
5
0
20 Jan 2025
Keeping LLMs Aligned After Fine-tuning: The Crucial Role of Prompt Templates
Neural Information Processing Systems (NeurIPS), 2024
Kaifeng Lyu
Haoyu Zhao
Xinran Gu
Dingli Yu
Anirudh Goyal
Sanjeev Arora
ALM
367
83
0
20 Jan 2025
SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation
International Conference on Learning Representations (ICLR), 2025
Mingjie Li
Wai Man Si
Michael Backes
Yang Zhang
Yisen Wang
296
36
0
03 Jan 2025
LLM-Virus: Evolutionary Jailbreak Attack on Large Language Models
Miao Yu
Cunchun Li
Yingjie Zhou
Xing Fan
Kun Wang
Shirui Pan
Qingsong Wen
AAML
346
8
0
03 Jan 2025
Exploring Response Uncertainty in MLLMs: An Empirical Evaluation under Misleading Scenarios
Yunkai Dang
Mengxi Gao
Yibo Yan
Xin Zou
Yanggan Gu
...
Jingyu Wang
Peijie Jiang
Aiwei Liu
Jia Liu
Xuming Hu
311
10
0
05 Nov 2024
SQL Injection Jailbreak: A Structural Disaster of Large Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Jiawei Zhao
Kejiang Chen
Weinan Zhang
Nenghai Yu
AAML
485
5
0
03 Nov 2024
Defense Against Prompt Injection Attack by Leveraging Attack Techniques
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Yulin Chen
Haoran Li
Zihao Zheng
Yangqiu Song
Dekai Wu
Bryan Hooi
AAML
SILM
532
22
0
01 Nov 2024
What is Wrong with Perplexity for Long-context Language Modeling?
International Conference on Learning Representations (ICLR), 2024
Lizhe Fang
Yifei Wang
Zhaoyang Liu
Chenheng Zhang
Stefanie Jegelka
Jinyang Gao
Bolin Ding
Yisen Wang
491
31
0
31 Oct 2024
Previous
1
2
3
4
Next