ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.06387
  4. Cited By
Jailbreak and Guard Aligned Language Models with Only Few In-Context
  Demonstrations
v1v2v3 (latest)

Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations

10 October 2023
Zeming Wei
Yifei Wang
Ang Li
Yichuan Mo
Yisen Wang
ArXiv (abs)PDFHTML

Papers citing "Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations"

50 / 164 papers shown
Title
Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression
Distillability of LLM Security Logic: Predicting Attack Success Rate of Outline Filling Attack via Ranking Regression
Tianyu Zhang
Zihang Xi
Jingyu Hua
Sheng Zhong
8
0
0
27 Nov 2025
Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation
Understanding and Mitigating Over-refusal for Large Language Models via Safety Representation
Junbo Zhang
Ran Chen
Qianli Zhou
Xinyang Deng
Wen Jiang
145
1
0
24 Nov 2025
TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization
TASO: Jailbreak LLMs via Alternative Template and Suffix Optimization
Yanting Wang
Runpeng Geng
Jinghui Chen
Minhao Cheng
Jinyuan Jia
166
0
0
23 Nov 2025
"To Survive, I Must Defect": Jailbreaking LLMs via the Game-Theory Scenarios
Zhen Sun
Zongmin Zhang
Deqi Liang
Han Sun
Yule Liu
...
Xiangshan Gao
Yilong Yang
Shuai Liu
Yutao Yue
Xinlei He
AAML
116
1
0
20 Nov 2025
Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment
Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment
Peng Zhang
Peijie Sun
322
0
0
10 Nov 2025
"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers
"Give a Positive Review Only": An Early Investigation Into In-Paper Prompt Injection Attacks and Defenses for AI Reviewers
Qin Zhou
Zhexin Zhang
Zhi Li
Limin Sun
AAML
102
0
0
03 Nov 2025
Measuring the Security of Mobile LLM Agents under Adversarial Prompts from Untrusted Third-Party Channels
Measuring the Security of Mobile LLM Agents under Adversarial Prompts from Untrusted Third-Party Channels
Chenghao Du
Quanfeng Huang
Tingxuan Tang
Zihao Wang
Adwait Nadkarni
Yue Xiao
AAML
217
0
0
31 Oct 2025
A Survey on Unlearning in Large Language Models
A Survey on Unlearning in Large Language Models
Ruichen Qiu
Jiajun Tan
Jiayue Pu
Honglin Wang
Xiao-Shan Gao
Fei Sun
MUAILawPILM
582
0
0
29 Oct 2025
Defending Against Prompt Injection with DataFilter
Defending Against Prompt Injection with DataFilter
Yizhu Wang
Sizhe Chen
Raghad Alkhudair
Basel Alomair
David Wagner
AAML
180
2
0
22 Oct 2025
Black-box Optimization of LLM Outputs by Asking for Directions
Black-box Optimization of LLM Outputs by Asking for Directions
Jie Zhang
Meng Ding
Yang Liu
Jue Hong
F. Tramèr
AAML
115
0
0
19 Oct 2025
SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models
SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models
Hanbin Hong
Shuya Feng
Nima Naderloui
Shenao Yan
Jingyu Zhang
Biying Liu
Ali Arastehfard
Heqing Huang
Yuan Hong
AAML
217
0
0
17 Oct 2025
Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers
Keep Calm and Avoid Harmful Content: Concept Alignment and Latent Manipulation Towards Safer Answers
Ruben Belo
Cláudia Soares
Marta Guimarães
KELM
91
0
0
14 Oct 2025
ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-test
ArtPerception: ASCII Art-based Jailbreak on LLMs with Recognition Pre-testJournal of Network and Computer Applications (JNCA), 2025
Guan-Yan Yang
Tzu-Yu Cheng
Ya-Wen Teng
Farn Wanga
Kuo-Hui Yeh
84
1
0
11 Oct 2025
A geometrical approach to solve the proximity of a point to an axisymmetric quadric in space
A geometrical approach to solve the proximity of a point to an axisymmetric quadric in space
Bibekananda Patra
Aditya Mahesh Kolte
Sandipan Bandyopadhyay
107
0
0
10 Oct 2025
The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections
The Attacker Moves Second: Stronger Adaptive Attacks Bypass Defenses Against Llm Jailbreaks and Prompt Injections
Milad Nasr
Nicholas Carlini
Chawin Sitawarin
Sander Schulhoff
Jamie Hayes
...
Ilia Shumailov
Abhradeep Thakurta
Kai Yuanqing Xiao
Seth Neel
F. Tramèr
AAMLELM
151
11
0
10 Oct 2025
Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models
Pattern Enhanced Multi-Turn Jailbreaking: Exploiting Structural Vulnerabilities in Large Language Models
Ragib Amin Nihal
Rui Wen
Kazuhiro Nakadai
Jun Sakuma
113
0
0
09 Oct 2025
Imperceptible Jailbreaking against Large Language Models
Imperceptible Jailbreaking against Large Language Models
Kuofeng Gao
Y. Li
Chao Du
X. Wang
Xingjun Ma
Shu-Tao Xia
Tianyu Pang
AAML
102
0
0
06 Oct 2025
Bypassing Prompt Guards in Production with Controlled-Release Prompting
Bypassing Prompt Guards in Production with Controlled-Release Prompting
Jaiden Fairoze
Sanjam Garg
Keewoo Lee
Mingyuan Wang
SILMAAML
201
1
0
02 Oct 2025
Better Privilege Separation for Agents by Restricting Data Types
Better Privilege Separation for Agents by Restricting Data Types
Dennis Jacob
Emad Alghamdi
Zhanhao Hu
Basel Alomair
David Wagner
AAML
64
0
0
30 Sep 2025
Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection
Think Twice, Generate Once: Safeguarding by Progressive Self-Reflection
Hoang Phan
Victor Li
Qi Lei
KELMCLL
154
0
0
29 Sep 2025
RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration
RADAR: A Risk-Aware Dynamic Multi-Agent Framework for LLM Safety Evaluation via Role-Specialized Collaboration
X. Chen
Jian Zhao
Yuchen Yuan
T. Zhang
Huilin Zhou
...
Ping Hu
Linghe Kong
Chi Zhang
Weiran Huang
Xuelong Li
271
3
0
28 Sep 2025
Jailbreaking on Text-to-Video Models via Scene Splitting Strategy
Jailbreaking on Text-to-Video Models via Scene Splitting Strategy
Wonjun Lee
Haon Park
Doehyeon Lee
Bumsub Ham
Suhyun Kim
124
0
0
26 Sep 2025
You Can't Steal Nothing: Mitigating Prompt Leakages in LLMs via System Vectors
You Can't Steal Nothing: Mitigating Prompt Leakages in LLMs via System Vectors
Bochuan Cao
Changjiang Li
Yuanpu Cao
Yameng Ge
Ting Wang
Jinghui Chen
AAML
84
1
0
26 Sep 2025
Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks
Bidirectional Intention Inference Enhances LLMs' Defense Against Multi-Turn Jailbreak Attacks
Haibo Tong
Dongcheng Zhao
Guobin Shen
Xiang He
Dachuan Lin
Feifei Zhao
Yi Zeng
AAML
108
0
0
25 Sep 2025
SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection
SABER: Uncovering Vulnerabilities in Safety Alignment via Cross-Layer Residual Connection
Maithili Joshi
Palash Nandi
Tanmoy Chakraborty
AAMLLLMSV
68
0
0
19 Sep 2025
K2-Think: A Parameter-Efficient Reasoning System
K2-Think: A Parameter-Efficient Reasoning System
Zhoujun Cheng
Richard Fan
Shibo Hao
Taylor W. Killian
Haonan Li
...
Xuezhe Ma
Guowei He
Zhiting Hu
Zhengzhong Liu
Eric P. Xing
ReLMOffRLALMLRM
242
3
0
09 Sep 2025
False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize
False Sense of Security: Why Probing-based Malicious Input Detection Fails to Generalize
Cheng Wang
Zeming Wei
Qin Liu
Muhao Chen
AAML
116
1
0
04 Sep 2025
A Comprehensive Survey on Trustworthiness in Reasoning with Large Language Models
A Comprehensive Survey on Trustworthiness in Reasoning with Large Language Models
Yanbo Wang
Yongcan Yu
Jian Liang
Ran He
HILMLRM
189
4
0
04 Sep 2025
Baichuan-M2: Scaling Medical Capability with Large Verifier System
Baichuan-M2: Scaling Medical Capability with Large Verifier System
Baichuan-M2 Team
Chengfeng Dou
Chong Liu
Chenzheng Zhu
Fei Li
...
Zheng Liang
Zhishou Zhang
Hengfu Cui
Zuyi Zhu
X. Wang
LM&MAELMLRM
124
13
0
02 Sep 2025
Unraveling LLM Jailbreaks Through Safety Knowledge Neurons
Unraveling LLM Jailbreaks Through Safety Knowledge Neurons
Chongwen Zhao
Kaizhu Huang
AAMLKELM
100
1
0
01 Sep 2025
On Surjectivity of Neural Networks: Can you elicit any behavior from your model?
On Surjectivity of Neural Networks: Can you elicit any behavior from your model?
Haozhe Jiang
Nika Haghtalab
124
2
0
26 Aug 2025
Speculative Safety-Aware Decoding
Speculative Safety-Aware Decoding
Xuekang Wang
Shengyu Zhu
Xueqi Cheng
126
0
0
25 Aug 2025
SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks
SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks
Xiangman Li
Xiaodong Wu
Qi Li
Jianbing Ni
Rongxing Lu
AAMLMUKELM
84
0
0
21 Aug 2025
SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models
SDGO: Self-Discrimination-Guided Optimization for Consistent Safety in Large Language Models
Peng Ding
Wen Sun
Dailin Li
Wei Zou
Jiaming Wang
Jiajun Chen
Shujian Huang
97
0
0
21 Aug 2025
CCFC: Core & Core-Full-Core Dual-Track Defense for LLM Jailbreak Protection
CCFC: Core & Core-Full-Core Dual-Track Defense for LLM Jailbreak Protection
Jiaming Hu
Haoyu Wang
Debarghya Mukherjee
Ioannis Ch. Paschalidis
AAML
60
0
0
19 Aug 2025
Mitigating Jailbreaks with Intent-Aware LLMs
Mitigating Jailbreaks with Intent-Aware LLMs
Wei Jie Yeo
Frank Xing
Erik Cambria
AAML
105
0
0
16 Aug 2025
Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation
Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation
Huizhen Shu
Xuying Li
Qirui Wang
Yuji Kosuga
Mengqiu Tian
Zhuo Li
AAMLSILM
137
0
0
14 Aug 2025
A Survey on Training-free Alignment of Large Language Models
A Survey on Training-free Alignment of Large Language Models
Birong Pan
Yongqi Li
Jiasheng Si
Sibo Wei
Mayi Xu
Shen Zhou
Yuanyuan Zhu
Ming Zhong
T. Qian
3DVLM&MA
312
0
0
12 Aug 2025
A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection
A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection
Ivan Zhang
AAML
74
0
0
10 Aug 2025
The Cost of Thinking: Increased Jailbreak Risk in Large Language Models
The Cost of Thinking: Increased Jailbreak Risk in Large Language Models
Fan Yang
144
1
0
09 Aug 2025
Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs
Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs
Jinhwa Kim
Ian G. Harris
AAML
63
0
0
09 Aug 2025
Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks
Fine-Grained Safety Neurons with Training-Free Continual Projection to Reduce LLM Fine Tuning Risks
Bing Han
Feifei Zhao
Dongcheng Zhao
Guobin Shen
Ping Wu
Yu Shi
Yi Zeng
156
0
0
08 Aug 2025
Automatic LLM Red Teaming
Automatic LLM Red Teaming
Roman Belaire
Arunesh Sinha
Pradeep Varakantham
LLMAG
151
0
0
06 Aug 2025
Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs
Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs
Sanhanat Sivapiromrat
Caiqi Zhang
Marco Basaldella
Nigel Collier
AAML
157
3
0
15 Jul 2025
Defending Against Prompt Injection With a Few DefensiveTokens
Defending Against Prompt Injection With a Few DefensiveTokens
Sizhe Chen
Yizhu Wang
Nicholas Carlini
Chawin Sitawarin
David Wagner
LLMAGAAMLSILM
185
12
0
10 Jul 2025
Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models
Response Attack: Exploiting Contextual Priming to Jailbreak Large Language Models
Ziqi Miao
Lijun Li
Yuan Xiong
Zhenhua Liu
Pengyu Zhu
Jing Shao
AAML
98
4
0
07 Jul 2025
Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks
Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks
Sizhe Chen
Arman Zharmagambetov
David Wagner
Chuan Guo
AAML
171
21
0
03 Jul 2025
Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection
Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection
Ziqi Miao
Yi Ding
Lijun Li
Jing Shao
AAML
214
6
0
03 Jul 2025
Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language Models
Probe before You Talk: Towards Black-box Defense against Backdoor Unalignment for Large Language ModelsInternational Conference on Learning Representations (ICLR), 2025
Biao Yi
Tiansheng Huang
Sishuo Chen
Tong Li
Zheli Liu
Zhixuan Chu
Yiming Li
AAML
195
19
0
19 Jun 2025
From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem
From LLMs to MLLMs to Agents: A Survey of Emerging Paradigms in Jailbreak Attacks and Defenses within LLM Ecosystem
Yanxu Mao
Tiehan Cui
Peipei Liu
Datao You
Hongsong Zhu
AAML
289
3
0
18 Jun 2025
1234
Next