ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.06255
  4. Cited By
Fight Back Against Jailbreaking via Prompt Adversarial Tuning
v1v2 (latest)

Fight Back Against Jailbreaking via Prompt Adversarial Tuning

9 February 2024
Yichuan Mo
Yuji Wang
Zeming Wei
Yisen Wang
    AAMLSILM
ArXiv (abs)PDFHTMLGithub (9★)

Papers citing "Fight Back Against Jailbreaking via Prompt Adversarial Tuning"

31 / 31 papers shown
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs
Yunhao Chen
Xin Wang
Juncheng Li
Yixu Wang
Jie Li
Yan Teng
Yingchun Wang
Xingjun Ma
AAML
333
1
0
16 Nov 2025
KG-DF: A Black-box Defense Framework against Jailbreak Attacks Based on Knowledge Graphs
KG-DF: A Black-box Defense Framework against Jailbreak Attacks Based on Knowledge Graphs
Shuyuan Liu
Jiawei Chen
Xiao Yang
Hang Su
Z. Yin
AAML
215
0
0
09 Nov 2025
A geometrical approach to solve the proximity of a point to an axisymmetric quadric in space
A geometrical approach to solve the proximity of a point to an axisymmetric quadric in space
Bibekananda Patra
Aditya Mahesh Kolte
Sandipan Bandyopadhyay
236
12
0
10 Oct 2025
Never Compromise to Vulnerabilities: A Comprehensive Survey on AI Governance
Never Compromise to Vulnerabilities: A Comprehensive Survey on AI Governance
Yuchu Jiang
Jian Zhao
Yuchen Yuan
Tianle Zhang
Yao Huang
...
Ya Zhang
Shuicheng Yan
Chi Zhang
Z. He
Xuelong Li
SILM
544
6
0
12 Aug 2025
Defending Against Prompt Injection With a Few DefensiveTokens
Defending Against Prompt Injection With a Few DefensiveTokens
Sizhe Chen
Yizhu Wang
Nicholas Carlini
Chawin Sitawarin
David Wagner
LLMAGAAMLSILM
298
26
0
10 Jul 2025
MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning
MIST: Jailbreaking Black-box Large Language Models via Iterative Semantic Tuning
Muyang Zheng
Yuanzhi Yao
C. D. Lin
Rui Wang
Meng Han
Zhiquan Liu
AAMLVLM
271
2
0
20 Jun 2025
DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt
DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt
Yitong Zhang
Jia Li
L. Cai
Ge Li
VLM
438
3
0
11 Jun 2025
Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs
Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs
Hiroshi Matsuda
Chunpeng Ma
Masayuki Asahara
402
9
0
11 Jun 2025
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMs
One Model Transfer to All: On Robust Jailbreak Prompts Generation against LLMsInternational Conference on Learning Representations (ICLR), 2025
Linbao Li
Y. Liu
Daojing He
Yu Li
AAML
363
6
0
23 May 2025
MixAT: Combining Continuous and Discrete Adversarial Training for LLMs
MixAT: Combining Continuous and Discrete Adversarial Training for LLMs
Csaba Dékány
Stefan Balauca
Robin Staab
Dimitar I. Dimitrov
Martin Vechev
AAML
443
5
0
22 May 2025
DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention Modification
DETAM: Defending LLMs Against Jailbreak Attacks via Targeted Attention ModificationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Yu Li
Han Jiang
Zhihua Wei
AAML
276
5
0
18 Apr 2025
STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models
STShield: Single-Token Sentinel for Real-Time Jailbreak Detection in Large Language Models
Xunguang Wang
Wenxuan Wang
Zhenlan Ji
Zongjie Li
Pingchuan Ma
Daoyuan Wu
Shuai Wang
325
6
0
23 Mar 2025
E$^2$AT: Multimodal Jailbreak Defense via Dynamic Joint Optimization for Multimodal Large Language Models
E2^22AT: Multimodal Jailbreak Defense via Dynamic Joint Optimization for Multimodal Large Language Models
Liming Lu
Shuchao Pang
Yaning Tan
Haotian Zhu
Xiyu Zeng
Aishan Liu
Yunhuai Liu
Yongbin Zhou
AAML
573
17
0
05 Mar 2025
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Zixuan Weng
Xiaolong Jin
Jinyuan Jia
Xinsong Zhang
AAML
898
27
0
27 Feb 2025
You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense
You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak DefenseThe Web Conference (WWW), 2025
Wuyuao Mai
Geng Hong
Pei Chen
Xudong Pan
Baojun Liu
Y. Zhang
Haixin Duan
Min Yang
AAML
386
8
0
21 Jan 2025
SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation
SaLoRA: Safety-Alignment Preserved Low-Rank AdaptationInternational Conference on Learning Representations (ICLR), 2025
Mingjie Li
Wai Man Si
Michael Backes
Yang Zhang
Yisen Wang
488
41
0
03 Jan 2025
New Emerged Security and Privacy of Pre-trained Model: a Survey and
  Outlook
New Emerged Security and Privacy of Pre-trained Model: a Survey and Outlook
Meng Yang
Tianqing Zhu
Chi Liu
Wanlei Zhou
Shui Yu
Philip S. Yu
AAMLELMPILM
357
2
0
12 Nov 2024
On the Adversarial Transferability of Generalized "Skip Connections"
On the Adversarial Transferability of Generalized "Skip Connections"
Yisen Wang
Yichuan Mo
Dongxian Wu
Mingjie Li
Jiabo He
Zhouchen Lin
SILMAAML
359
3
0
11 Oct 2024
Recent advancements in LLM Red-Teaming: Techniques, Defenses, and
  Ethical Considerations
Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations
Tarun Raheja
Nilay Pochhi
AAML
267
12
0
09 Oct 2024
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Mission Impossible: A Statistical Perspective on Jailbreaking LLMsNeural Information Processing Systems (NeurIPS), 2024
Jingtong Su
Mingyu Lee
SangKeun Lee
279
30
0
02 Aug 2024
Know Your Limits: A Survey of Abstention in Large Language Models
Know Your Limits: A Survey of Abstention in Large Language Models
Bingbing Wen
Jihan Yao
Shangbin Feng
Chenjun Xu
Yulia Tsvetkov
Bill Howe
Lucy Lu Wang
628
5
0
25 Jul 2024
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Apurv Verma
Satyapriya Krishna
Sebastian Gehrmann
Madhavan Seshadri
Anu Pradhan
Tom Ault
Leslie Barrett
David Rabinowitz
John Doucette
Nhathai Phan
576
46
0
20 Jul 2024
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
Haibo Jin
Leyang Hu
Xinuo Li
Peiyan Zhang
Chonghan Chen
Jun Zhuang
Haohan Wang
PILM
537
67
0
26 Jun 2024
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
SelfDefend: LLMs Can Defend Themselves against Jailbreaking in a Practical Manner
Xunguang Wang
Daoyuan Wu
Zhenlan Ji
Zongjie Li
Pingchuan Ma
Shuai Wang
Yingjiu Li
Yang Liu
Ning Liu
Juergen Rahmel
AAML
701
44
0
08 Jun 2024
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models
  and Their Defenses
Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min Lin
AAML
392
79
0
03 Jun 2024
A Theoretical Understanding of Self-Correction through In-context
  Alignment
A Theoretical Understanding of Self-Correction through In-context Alignment
Yifei Wang
Yuyang Wu
Zeming Wei
Stefanie Jegelka
Yisen Wang
LRM
300
59
0
28 May 2024
LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A
  Vision Paper
LLMs Can Defend Themselves Against Jailbreaking in a Practical Manner: A Vision Paper
Daoyuan Wu
Shuaibao Wang
Yang Liu
Ning Liu
AAML
306
13
0
24 Feb 2024
On the Duality Between Sharpness-Aware Minimization and Adversarial
  Training
On the Duality Between Sharpness-Aware Minimization and Adversarial Training
Yihao Zhang
Hangzhou He
Jingyu Zhu
Huanran Chen
Yifei Wang
Zeming Wei
AAML
439
29
0
23 Feb 2024
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM
  Agents Exponentially Fast
Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast
Xiangming Gu
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Ye Wang
Jing Jiang
Min Lin
LLMAGLM&Ro
310
124
0
13 Feb 2024
Robust Prompt Optimization for Defending Language Models Against
  Jailbreaking Attacks
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
Andy Zhou
Bo Li
Haohan Wang
AAML
535
157
0
30 Jan 2024
Certifying LLM Safety against Adversarial Prompting
Certifying LLM Safety against Adversarial Prompting
Aounon Kumar
Chirag Agarwal
Suraj Srinivas
Aaron Jiaxun Li
Soheil Feizi
Himabindu Lakkaraju
AAML
819
304
0
06 Sep 2023
1
Page 1 of 1