ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.01288
  4. Cited By
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models
  and Their Defenses

Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses

3 June 2024
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min Lin
    AAML
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)Github (61★)

Papers citing "Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses"

44 / 44 papers shown
Title
Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization
Adversarial Attack-Defense Co-Evolution for LLM Safety Alignment via Tree-Group Dual-Aware Search and Optimization
Xurui Li
Kaisong Song
Rui Zhu
Pin-Yu Chen
Haixu Tang
AAML
373
0
0
24 Nov 2025
SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models
SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models
Hanbin Hong
Shuya Feng
Nima Naderloui
Shenao Yan
Jingyu Zhang
Biying Liu
Ali Arastehfard
Heqing Huang
Yuan Hong
AAML
225
0
0
17 Oct 2025
A geometrical approach to solve the proximity of a point to an axisymmetric quadric in space
A geometrical approach to solve the proximity of a point to an axisymmetric quadric in space
Bibekananda Patra
Aditya Mahesh Kolte
Sandipan Bandyopadhyay
107
11
0
10 Oct 2025
Imperceptible Jailbreaking against Large Language Models
Imperceptible Jailbreaking against Large Language Models
Kuofeng Gao
Y. Li
Chao Du
X. Wang
Xingjun Ma
Shu-Tao Xia
Tianyu Pang
AAML
110
0
0
06 Oct 2025
NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks
NEXUS: Network Exploration for eXploiting Unsafe Sequences in Multi-Turn LLM Jailbreaks
J. Asl
Sidhant Narula
Mohammad Ghasemigol
Eduardo Blanco
Daniel Takabi
AAML
155
0
0
03 Oct 2025
SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models
SafeBehavior: Simulating Human-Like Multistage Reasoning to Mitigate Jailbreak Attacks in Large Language Models
Qinjian Zhao
Jiaqi Wang
Zhiqiang Gao
Zhihao Dou
Belal Abuhaija
Kaizhu Huang
AAMLLRM
100
0
0
30 Sep 2025
SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks
SafeLLM: Unlearning Harmful Outputs from Large Language Models against Jailbreak Attacks
Xiangman Li
Xiaodong Wu
Qi Li
Jianbing Ni
Rongxing Lu
AAMLMUKELM
84
0
0
21 Aug 2025
Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation
Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation
Dongyoon Hahm
Taywon Min
Woogyeol Jin
Kimin Lee
SILM
164
3
0
19 Aug 2025
Mitigating Jailbreaks with Intent-Aware LLMs
Mitigating Jailbreaks with Intent-Aware LLMs
Wei Jie Yeo
Frank Xing
Erik Cambria
AAML
121
0
0
16 Aug 2025
Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation
Layer-Wise Perturbations via Sparse Autoencoders for Adversarial Text Generation
Huizhen Shu
Xuying Li
Qirui Wang
Yuji Kosuga
Mengqiu Tian
Zhuo Li
AAMLSILM
137
0
0
14 Aug 2025
Many-Turn Jailbreaking
Many-Turn Jailbreaking
Xianjun Yang
Liqiang Xiao
Shiyang Li
Faisal Ladhak
Hyokun Yun
Linda R. Petzold
Yi Xu
William Wang
111
0
0
09 Aug 2025
MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs
MetaCipher: A Time-Persistent and Universal Multi-Agent Framework for Cipher-Based Jailbreak Attacks for LLMs
Boyuan Chen
Minghao Shao
Abdul Basit
S. Garg
Muhammad Shafique
AAML
182
0
0
27 Jun 2025
Lifelong Safety Alignment for Language Models
Lifelong Safety Alignment for Language Models
Haoyu Wang
Zeyu Qin
Yifei Zhao
C. Du
Min Lin
Xueqian Wang
Tianyu Pang
KELMCLL
252
5
0
26 May 2025
What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs
What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Sangyeop Kim
Yohan Lee
Yongwoo Song
Kimin Lee
AAML
175
0
0
26 May 2025
Chain-of-Lure: A Universal Jailbreak Attack Framework using Unconstrained Synthetic Narratives
Chain-of-Lure: A Universal Jailbreak Attack Framework using Unconstrained Synthetic Narratives
Wenhan Chang
Tianqing Zhu
Yu Zhao
Shuangyong Song
Ping Xiong
Wanlei Zhou
249
2
0
23 May 2025
MixAT: Combining Continuous and Discrete Adversarial Training for LLMs
MixAT: Combining Continuous and Discrete Adversarial Training for LLMs
Csaba Dékány
Stefan Balauca
Robin Staab
Dimitar I. Dimitrov
Martin Vechev
AAML
251
1
0
22 May 2025
"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs
"Haet Bhasha aur Diskrimineshun": Phonetic Perturbations in Code-Mixed Hinglish to Red-Team LLMs
Darpan Aswal
Siddharth D Jaiswal
AAML
171
0
0
20 May 2025
Multilingual Collaborative Defense for Large Language Models
Multilingual Collaborative Defense for Large Language Models
Hongliang Li
Jinan Xu
Gengping Cui
Changhao Guan
Fengran Mo
Kaiyu Huang
AAML
330
0
0
17 May 2025
PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization
PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context OptimizationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Yidan Wang
Yanan Cao
Yubing Ren
Fang Fang
Zheng Lin
Binxing Fang
PILM
432
6
0
15 May 2025
RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models
RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025
Bang An
Shiyue Zhang
Mark Dredze
373
19
0
25 Apr 2025
Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models
Jailbreak Detection in Clinical Training LLMs Using Feature-Based Predictive Models
Tri Nguyen
Lohith Srikanth Pentapalli
Magnus Sieverding
Laurah Turner
Seth Overla
...
Michael Gharib
Matt Kelleher
Michael Shukis
Cameron Pawlik
Kelly Cohen
238
0
0
21 Apr 2025
Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses
Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses
Zhengchun Shang
Wenlan Wei
Weiheng Bai
AAML
339
7
0
02 Apr 2025
Exploiting Instruction-Following Retrievers for Malicious Information RetrievalAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Parishad BehnamGhader
Nicholas Meade
Siva Reddy
245
3
0
11 Mar 2025
MAD-MAX: Modular And Diverse Malicious Attack MiXtures for Automated LLM Red Teaming
MAD-MAX: Modular And Diverse Malicious Attack MiXtures for Automated LLM Red Teaming
Stefan Schoepf
Muhammad Zaid Hameed
Ambrish Rawat
Kieran Fraser
Giulio Zizzo
Giandomenico Cornacchia
Mark Purcell
188
0
0
08 Mar 2025
GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods
GuidedBench: Measuring and Mitigating the Evaluation Discrepancies of In-the-wild LLM Jailbreak Methods
Ruixuan Huang
Xunguang Wang
Zongjie Li
Daoyuan Wu
Shuai Wang
ALMELM
364
0
0
24 Feb 2025
SQL Injection Jailbreak: A Structural Disaster of Large Language Models
SQL Injection Jailbreak: A Structural Disaster of Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Jiawei Zhao
Kejiang Chen
Weinan Zhang
Nenghai Yu
AAML
493
5
0
03 Nov 2024
Plentiful Jailbreaks with String Compositions
Plentiful Jailbreaks with String Compositions
Brian R. Y. Huang
AAML
338
3
0
01 Nov 2024
Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs
Desert Camels and Oil Sheikhs: Arab-Centric Red Teaming of Frontier LLMs
Muhammed Saeed
Elgizouli Mohamed
Mukhtar Mohamed
Shaina Raza
Muhammad Abdul-Mageed
Shady Shehata
223
0
0
31 Oct 2024
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent
  Enhanced Explanation Evaluation Framework
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework
Fan Liu
Yue Feng
Zhao Xu
Lixin Su
Xinyu Ma
D. Yin
Hao Liu
ELM
297
33
0
11 Oct 2024
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy
Instructional Segment Embedding: Improving LLM Safety with Instruction HierarchyInternational Conference on Learning Representations (ICLR), 2024
Tong Wu
Shujian Zhang
Kaiqiang Song
Silei Xu
Sanqiang Zhao
Ravi Agrawal
Sathish Indurthi
Chong Xiang
Prateek Mittal
Wenxuan Zhou
362
29
0
09 Oct 2024
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win RatesInternational Conference on Learning Representations (ICLR), 2024
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min Lin
237
22
0
09 Oct 2024
FlipAttack: Jailbreak LLMs via Flipping
FlipAttack: Jailbreak LLMs via Flipping
Yue Liu
Xiaoxin He
Miao Xiong
Jinlan Fu
Shumin Deng
Bryan Hooi
AAML
213
38
0
02 Oct 2024
AI Safety in Generative AI Large Language Models: A Survey
AI Safety in Generative AI Large Language Models: A Survey
Jaymari Chua
Yun Yvonna Li
Shiyi Yang
Chen Wang
Lina Yao
LM&MA
333
35
0
06 Jul 2024
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Sibo Yi
Yule Liu
Zhen Sun
Tianshuo Cong
Xinlei He
Jiaxing Song
Ke Xu
Qi Li
AAML
291
189
0
05 Jul 2024
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
Zhao Xu
Fan Liu
Hao Liu
AAML
230
25
0
13 Jun 2024
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
Xunguang Wang
Daoyuan Wu
Zhenlan Ji
Zongjie Li
Pingchuan Ma
Shuai Wang
Yingjiu Li
Yang Liu
Ning Liu
Juergen Rahmel
AAML
448
31
0
08 Jun 2024
Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs
Fan Liu
Zhao Xu
Hao Liu
AAML
234
24
0
07 Jun 2024
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks
Defensive Prompt Patch: A Robust and Interpretable Defense of LLMs against Jailbreak Attacks
Chen Xiong
Xiangyu Qi
Pin-Yu Chen
Tsung-Yi Ho
AAML
342
32
0
30 May 2024
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive AttacksInternational Conference on Learning Representations (ICLR), 2024
Maksym Andriushchenko
Francesco Croce
Nicolas Flammarion
AAML
728
354
0
02 Apr 2024
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large
  Language Models
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
Patrick Chao
Edoardo Debenedetti
Avi Schwarzschild
Maksym Andriushchenko
Francesco Croce
...
Nicolas Flammarion
George J. Pappas
F. Tramèr
Hamed Hassani
Eric Wong
ALMELMAAML
379
269
0
28 Mar 2024
Defending Jailbreak Prompts via In-Context Adversarial Game
Defending Jailbreak Prompts via In-Context Adversarial Game
Yujun Zhou
Yufei Han
Haomin Zhuang
Kehan Guo
Zhenwen Liang
Hongyan Bao
Xiangliang Zhang
LLMAGAAML
416
26
0
20 Feb 2024
Leveraging the Context through Multi-Round Interactions for Jailbreaking
  Attacks
Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks
Yixin Cheng
Markos Georgopoulos
Volkan Cevher
Grigorios G. Chrysos
AAML
153
23
0
14 Feb 2024
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
Avi Schwarzschild
Eric Wong
Hamed Hassani
George J. Pappas
AAML
563
372
0
05 Oct 2023
Certifying LLM Safety against Adversarial Prompting
Certifying LLM Safety against Adversarial Prompting
Aounon Kumar
Chirag Agarwal
Suraj Srinivas
Aaron Jiaxun Li
Soheil Feizi
Himabindu Lakkaraju
AAML
650
261
0
06 Sep 2023
1