ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.06387
  4. Cited By
Jailbreak and Guard Aligned Language Models with Only Few In-Context
  Demonstrations
v1v2v3 (latest)

Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

10 October 2023
Zeming Wei
Yifei Wang
Ang Li
Yichuan Mo
Yisen Wang
ArXiv (abs)PDFHTML

Papers citing "Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations"

50 / 165 papers shown
Title
AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to
  Jailbreak LLMs with Higher Success Rates in Fewer Attempts
AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts
Vishal Kumar
Zeyi Liao
Jaylen Jones
Huan Sun
AAML
247
8
0
29 Oct 2024
Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring
Stealthy Jailbreak Attacks on Large Language Models via Benign Data MirroringNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Honglin Mu
Han He
Yuxin Zhou
Yunlong Feng
Yang Xu
...
Zeming Liu
Xudong Han
Qi Shi
Qingfu Zhu
Wanxiang Che
AAML
251
2
0
28 Oct 2024
LLMScan: Causal Scan for LLM Misbehavior Detection
LLMScan: Causal Scan for LLM Misbehavior Detection
Mengdi Zhang
Kai Kiat Goh
Peixin Zhang
Jun Sun
Rose Lin Xin
Hongyu Zhang
539
4
0
22 Oct 2024
Bayesian scaling laws for in-context learning
Bayesian scaling laws for in-context learning
Aryaman Arora
Dan Jurafsky
Christopher Potts
Noah D. Goodman
433
11
0
21 Oct 2024
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation
Qizhang Li
Xiaochen Yang
W. Zuo
Yiwen Guo
AAML
304
2
0
15 Oct 2024
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy
Instructional Segment Embedding: Improving LLM Safety with Instruction HierarchyInternational Conference on Learning Representations (ICLR), 2024
Tong Wu
Shujian Zhang
Kaiqiang Song
Silei Xu
Sanqiang Zhao
Ravi Agrawal
Sathish Indurthi
Chong Xiang
Prateek Mittal
Wenxuan Zhou
370
29
0
09 Oct 2024
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win RatesInternational Conference on Learning Representations (ICLR), 2024
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min Lin
237
22
0
09 Oct 2024
Non-Halting Queries: Exploiting Fixed Points in LLMs
Non-Halting Queries: Exploiting Fixed Points in LLMs
Ghaith Hammouri
Kemal Derya
B. Sunar
241
1
0
08 Oct 2024
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs
AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMsInternational Conference on Learning Representations (ICLR), 2024
Xiaogeng Liu
Peiran Li
Edward Suh
Yevgeniy Vorobeychik
Zhuoqing Mao
Somesh Jha
Patrick McDaniel
Huan Sun
Bo Li
Chaowei Xiao
450
87
0
03 Oct 2024
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language ModelsInternational Conference on Learning Representations (ICLR), 2024
Guobin Shen
Dongcheng Zhao
Yiting Dong
Xiang He
Yi Zeng
AAML
324
11
0
03 Oct 2024
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard ModelsInternational Conference on Learning Representations (ICLR), 2024
Seanie Lee
Haebin Seong
Dong Bok Lee
Minki Kang
Xiaoyin Chen
Dominik Wagner
Yoshua Bengio
Juho Lee
Sung Ju Hwang
378
13
0
02 Oct 2024
Endless Jailbreaks with Bijection Learning
Endless Jailbreaks with Bijection LearningInternational Conference on Learning Representations (ICLR), 2024
Brian R. Y. Huang
Maximilian Li
Leonard Tang
AAML
339
12
0
02 Oct 2024
CoCA: Regaining Safety-awareness of Multimodal Large Language Models
  with Constitutional Calibration
CoCA: Regaining Safety-awareness of Multimodal Large Language Models with Constitutional Calibration
Lei Li
Renjie Pi
Tianyang Han
Han Wu
Lanqing Hong
Lingpeng Kong
Xin Jiang
Zhenguo Li
273
19
0
17 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language
  Models
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILMAAML
339
8
0
05 Sep 2024
Defending against Jailbreak through Early Exit Generation of Large Language Models
Defending against Jailbreak through Early Exit Generation of Large Language Models
Chongwen Zhao
Zhihao Dou
Kaizhu Huang
AAML
217
3
0
21 Aug 2024
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Apurv Verma
Satyapriya Krishna
Sebastian Gehrmann
Madhavan Seshadri
Anu Pradhan
Tom Ault
Leslie Barrett
David Rabinowitz
John Doucette
Nhathai Phan
416
39
0
20 Jul 2024
Does Refusal Training in LLMs Generalize to the Past Tense?
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko
Nicolas Flammarion
515
64
0
16 Jul 2024
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Shu Yang
Jiahao Xu
Tian Liang
Pinjia He
Zhaopeng Tu
336
47
0
12 Jul 2024
Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks
Large Language Models Are Involuntary Truth-Tellers: Exploiting Fallacy Failure for Jailbreak Attacks
Yue Zhou
Henry Peng Zou
Barbara Di Eugenio
Yang Zhang
LRMHILM
315
11
0
01 Jul 2024
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
Haibo Jin
Leyang Hu
Xinuo Li
Peiyan Zhang
Chonghan Chen
Jun Zhuang
Haohan Wang
PILM
391
58
0
26 Jun 2024
ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods
ReCaLL: Membership Inference via Relative Conditional Log-Likelihoods
Roy Xie
Junlin Wang
Ruomin Huang
Minxing Zhang
Rong Ge
Jian Pei
Neil Zhenqiang Gong
Bhuwan Dhingra
MIALM
508
37
0
23 Jun 2024
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
Siyuan Wang
Zhuohan Long
Zhihao Fan
Zhongyu Wei
180
20
0
21 Jun 2024
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model
SPA-VL: A Comprehensive Safety Preference Alignment Dataset for Vision Language Model
Yongting Zhang
Lu Chen
Guodong Zheng
Yifeng Gao
Rui Zheng
...
Yu Qiao
Xuanjing Huang
Feng Zhao
Tao Gui
Jing Shao
VLM
477
59
0
17 Jun 2024
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
Delong Ran
Jinyuan Liu
Yichen Gong
Jingyi Zheng
Xinlei He
Tianshuo Cong
Anyu Wang
ELM
445
22
0
13 Jun 2024
How Alignment and Jailbreak Work: Explain LLM Safety through
  Intermediate Hidden States
How Alignment and Jailbreak Work: Explain LLM Safety through Intermediate Hidden StatesConference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Zhenhong Zhou
Haiyang Yu
Xinghua Zhang
Rongwu Xu
Fei Huang
Yongbin Li
331
73
0
09 Jun 2024
Machine Against the RAG: Jamming Retrieval-Augmented Generation with Blocker Documents
Machine Against the RAG: Jamming Retrieval-Augmented Generation with Blocker Documents
Avital Shafran
R. Schuster
Vitaly Shmatikov
665
60
0
09 Jun 2024
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
Xunguang Wang
Daoyuan Wu
Zhenlan Ji
Zongjie Li
Pingchuan Ma
Shuai Wang
Yingjiu Li
Yang Liu
Ning Liu
Juergen Rahmel
AAML
452
31
0
08 Jun 2024
AI Agents Under Threat: A Survey of Key Security Challenges and Future
  Pathways
AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways
Zehang Deng
Yongjian Guo
Changzhou Han
Wanlun Ma
Junwu Xiong
Sheng Wen
Yang Xiang
381
118
0
04 Jun 2024
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models
  and Their Defenses
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min Lin
AAML
282
67
0
03 Jun 2024
Mind the Inconspicuous: Revealing the Hidden Weakness in Aligned LLMs' Refusal Boundaries
Mind the Inconspicuous: Revealing the Hidden Weakness in Aligned LLMs' Refusal Boundaries
Jiahao Yu
Haozheng Luo
Jerry Yao-Chieh Hu
Wenbo Guo
Han Liu
Xinyu Xing
280
21
0
31 May 2024
OR-Bench: An Over-Refusal Benchmark for Large Language Models
OR-Bench: An Over-Refusal Benchmark for Large Language Models
Justin Cui
Wei-Lin Chiang
Ion Stoica
Cho-Jui Hsieh
ALM
664
96
0
31 May 2024
Jailbreaking Large Language Models Against Moderation Guardrails via
  Cipher Characters
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters
Haibo Jin
Andy Zhou
Joe D. Menke
Haohan Wang
188
37
0
30 May 2024
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks
Chen Xiong
Xiangyu Qi
Pin-Yu Chen
Tsung-Yi Ho
AAML
346
32
0
30 May 2024
ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based
  Evaluation
ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based EvaluationNeural Information Processing Systems (NeurIPS), 2024
Jingnan Zheng
Han Wang
An Zhang
Tai D. Nguyen
Jun Sun
Tat-Seng Chua
LLMAG
323
38
0
23 May 2024
GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation
Govind Ramesh
Yao Dou
Wei Xu
PILM
213
22
0
21 May 2024
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs
Anselm Paulus
Arman Zharmagambetov
Chuan Guo
Brandon Amos
Yuandong Tian
AAML
339
120
0
21 Apr 2024
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive AttacksInternational Conference on Learning Representations (ICLR), 2024
Maksym Andriushchenko
Francesco Croce
Nicolas Flammarion
AAML
728
356
0
02 Apr 2024
RigorLLM: Resilient Guardrails for Large Language Models against
  Undesired Content
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content
Zhuowen Yuan
Zidi Xiong
Yi Zeng
Ning Yu
Ruoxi Jia
Basel Alomair
Yue Liu
AAMLKELM
233
64
0
19 Mar 2024
Large language models in 6G security: challenges and opportunities
Large language models in 6G security: challenges and opportunities
Tri Nguyen
Huong Nguyen
Ahmad Ijaz
Saeid Sheikhi
Athanasios V. Vasilakos
Panos Kostakos
ELM
246
24
0
18 Mar 2024
EasyJailbreak: A Unified Framework for Jailbreaking Large Language
  Models
EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models
Weikang Zhou
Xiao Wang
Limao Xiong
Han Xia
Yingshuang Gu
...
Lijun Li
Jing Shao
Tao Gui
Tao Gui
Xuanjing Huang
210
54
0
18 Mar 2024
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language ModelsEuropean Conference on Computer Vision (ECCV), 2024
Yifan Li
Hangyu Guo
Kun Zhou
Wayne Xin Zhao
Ji-Rong Wen
457
90
0
14 Mar 2024
Guardrail Baselines for Unlearning in LLMs
Guardrail Baselines for Unlearning in LLMs
Pratiksha Thaker
Yash Maurya
Shengyuan Hu
Zhiwei Steven Wu
Virginia Smith
MU
388
75
0
05 Mar 2024
AutoAttacker: A Large Language Model Guided System to Implement
  Automatic Cyber-attacks
AutoAttacker: A Large Language Model Guided System to Implement Automatic Cyber-attacks
Jiacen Xu
Jack W. Stokes
Geoff McDonald
Xuesong Bai
David Marshall
Siyue Wang
Adith Swaminathan
Zhou Li
247
98
0
02 Mar 2024
Speak Out of Turn: Safety Vulnerability of Large Language Models in
  Multi-turn Dialogue
Speak Out of Turn: Safety Vulnerability of Large Language Models in Multi-turn Dialogue
Zhenhong Zhou
Jiuyang Xiang
Haopeng Chen
Quan Liu
Zherui Li
Sen Su
202
52
0
27 Feb 2024
On the Duality Between Sharpness-Aware Minimization and Adversarial
  Training
On the Duality Between Sharpness-Aware Minimization and Adversarial Training
Yihao Zhang
Hangzhou He
Jingyu Zhu
Huanran Chen
Yifei Wang
Zeming Wei
AAML
353
23
0
23 Feb 2024
Defending Jailbreak Prompts via In-Context Adversarial Game
Defending Jailbreak Prompts via In-Context Adversarial Game
Yujun Zhou
Yufei Han
Haomin Zhuang
Kehan Guo
Zhenwen Liang
Hongyan Bao
Xiangliang Zhang
LLMAGAAML
428
26
0
20 Feb 2024
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
Fengqing Jiang
Zhangchen Xu
Luyao Niu
Zhen Xiang
Bhaskar Ramasubramanian
Bo Li
Radha Poovendran
477
189
0
19 Feb 2024
Leveraging the Context through Multi-Round Interactions for Jailbreaking
  Attacks
Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
Yixin Cheng
Markos Georgopoulos
Volkan Cevher
Grigorios G. Chrysos
AAML
153
23
0
14 Feb 2024
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM
  Agents Exponentially Fast
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
Xiangming Gu
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Ye Wang
Jing Jiang
Min Lin
LLMAGLM&Ro
186
94
0
13 Feb 2024
In-Context Learning Can Re-learn Forbidden Tasks
In-Context Learning Can Re-learn Forbidden Tasks
Sophie Xhonneux
David Dobre
Jian Tang
Gauthier Gidel
Dhanya Sridhar
177
6
0
08 Feb 2024
Previous
1234
Next