ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2403.04783
  4. Cited By
AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks

2 March 2024
Yifan Zeng
Yiran Wu
Xiao Zhang
Huazheng Wang
Qingyun Wu
    LLMAGAAML
ArXiv (abs)PDFHTMLHuggingFace (2 upvotes)Github (46★)

Papers citing "AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks"

46 / 46 papers shown
Title
AlignTree: Efficient Defense Against LLM Jailbreak Attacks
AlignTree: Efficient Defense Against LLM Jailbreak Attacks
Gil Goren
Shahar Katz
Lior Wolf
AAML
139
0
0
15 Nov 2025
From Evidence to Verdict: An Agent-Based Forensic Framework for AI-Generated Image Detection
From Evidence to Verdict: An Agent-Based Forensic Framework for AI-Generated Image Detection
Mengfei Liang
Y. Qu
Yukun Jiang
Michael Backes
Yang Zhang
116
0
0
31 Oct 2025
Multi-Agent Evolve: LLM Self-Improve through Co-evolution
Multi-Agent Evolve: LLM Self-Improve through Co-evolution
Yixing Chen
Yiding Wang
Siqi Zhu
Haofei Yu
Tao Feng
Muhan Zhang
M. Patwary
Jiaxuan You
LLMAGLRM
251
4
0
27 Oct 2025
Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks
Sentra-Guard: A Multilingual Human-AI Framework for Real-Time Defense Against Adversarial LLM Jailbreaks
Md. Mehedi Hasan
Ziaur Rahman
Rafid Mostafiz
Md. Abir Hossain
AAML
88
0
0
26 Oct 2025
SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models
SoK: Taxonomy and Evaluation of Prompt Security in Large Language Models
Hanbin Hong
Shuya Feng
Nima Naderloui
Shenao Yan
Jingyu Zhang
Biying Liu
Ali Arastehfard
Heqing Huang
Yuan Hong
AAML
181
0
0
17 Oct 2025
Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks
Active Honeypot Guardrail System: Probing and Confirming Multi-Turn LLM Jailbreaks
ChenYu Wu
Yi Wang
Yang Liao
84
0
0
16 Oct 2025
The Alignment Waltz: Jointly Training Agents to Collaborate for Safety
The Alignment Waltz: Jointly Training Agents to Collaborate for Safety
Jingyu Zhang
Haozhu Wang
Eric Michael Smith
Sid Wang
Amr Sharaf
Mahesh Pasupuleti
Benjamin Van Durme
Daniel Khashabi
Jason Weston
Hongyuan Zhan
72
0
0
09 Oct 2025
Proactive defense against LLM Jailbreak
Proactive defense against LLM Jailbreak
Weiliang Zhao
Jinjun Peng
Daniel Ben-Levi
Zhou Yu
Junfeng Yang
AAML
119
1
0
06 Oct 2025
Machine Learning for Detection and Analysis of Novel LLM Jailbreaks
Machine Learning for Detection and Analysis of Novel LLM Jailbreaks
J. Hawkins
Aditya Pramar
R. Beard
Rohitash Chandra
125
0
0
02 Oct 2025
Agentic AutoSurvey: Let LLMs Survey LLMs
Agentic AutoSurvey: Let LLMs Survey LLMs
Yixin Liu
Yonghui Wu
Denghui Zhang
Lichao Sun
AI4CE
96
1
0
23 Sep 2025
LLM Jailbreak Detection for (Almost) Free!
LLM Jailbreak Detection for (Almost) Free!
Guorui Chen
Yifan Xia
Xiaojun Jia
Ruoyao Xiao
Juil Sock
Jindong Gu
70
0
0
18 Sep 2025
Aegis: Automated Error Generation and Attribution for Multi-Agent Systems
Aegis: Automated Error Generation and Attribution for Multi-Agent Systems
Fanqi Kong
Ruijie Zhang
Huaxiao Yin
Guibin Zhang
X. Zhang
Ziang Chen
Zhaowei Zhang
Xiaoyuan Zhang
Song-Chun Zhu
Xue Feng
AAML
256
0
0
17 Sep 2025
Evaluating the Robustness of Retrieval-Augmented Generation to Adversarial Evidence in the Health Domain
Evaluating the Robustness of Retrieval-Augmented Generation to Adversarial Evidence in the Health Domain
Shakiba Amirshahi
Amin Bigdeli
Charles L. A. Clarke
Amira Ghenai
AAML
64
1
0
04 Sep 2025
Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models
Retrieval-Augmented Defense: Adaptive and Controllable Jailbreak Prevention for Large Language Models
Guangyu Yang
Jinghong Chen
Jingbiao Mei
Weizhe Lin
Bill Byrne
AAML
88
0
0
22 Aug 2025
A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection
A Real-Time, Self-Tuning Moderator Framework for Adversarial Prompt Detection
Ivan Zhang
AAML
58
0
0
10 Aug 2025
SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation
SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation
Lai Jiang
Yuekang Li
Xiaohan Zhang
Youtao Ding
Li Pan
65
0
0
08 Aug 2025
BlockA2A: Towards Secure and Verifiable Agent-to-Agent Interoperability
BlockA2A: Towards Secure and Verifiable Agent-to-Agent Interoperability
Zhenhua Zou
Zhuotao Liu
Lepeng Zhao
Qiuyang Zhan
LLMAG
192
2
0
02 Aug 2025
ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation
ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation
Yiran Wu
Mauricio Velazco
Andrew Zhao
Manuel Raúl Meléndez Luján
Srisuma Movva
...
Roberto Rodriguez
Qingyun Wu
Michael Albada
Julia Kiseleva
Anand Mudgerikar
LLMAGELM
214
4
0
14 Jul 2025
Improving Large Language Model Safety with Contrastive Representation Learning
Improving Large Language Model Safety with Contrastive Representation Learning
Samuel Simko
Mrinmaya Sachan
Bernhard Schölkopf
Zhijing Jin
AAML
165
0
0
13 Jun 2025
Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs
Step-by-step Instructions and a Simple Tabular Output Format Improve the Dependency Parsing Accuracy of LLMs
Hiroshi Matsuda
Chunpeng Ma
Masayuki Asahara
212
3
0
11 Jun 2025
Demonstrations of Integrity Attacks in Multi-Agent Systems
Can Zheng
Yuhan Cao
Xiaoning Dong
Tianxing He
LLMAGAAML
170
3
0
05 Jun 2025
Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness
Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness
Yongjin Yang
Euiin Yi
Jongwoo Ko
Kimin Lee
Zhijing Jin
Se-Young Yun
LLMAG
195
9
0
29 May 2025
Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers
Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers
Viet-Anh Nguyen
Shiqian Zhao
Gia Dao
Runyi Hu
Yi Xie
Luu Anh Tuan
AAMLLRM
276
6
0
22 May 2025
Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems through Benign Queries
Silent Leaks: Implicit Knowledge Extraction Attack on RAG Systems through Benign Queries
Yuhao Wang
Wenjie Qu
Shengfang Zhai
Zichen Liu
Zichen Liu
Shengfang Zhai
Yinpeng Dong
Jiaheng Zhang
SILM
220
3
0
21 May 2025
PeerGuard: Defending Multi-Agent Systems Against Backdoor Attacks Through Mutual Reasoning
PeerGuard: Defending Multi-Agent Systems Against Backdoor Attacks Through Mutual ReasoningIEEE International Conference on Information Reuse and Integration (IRI), 2025
Falong Fan
Xi Li
LLMAGAAML
233
6
0
16 May 2025
A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?
A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?
Ada Chen
Yongjiang Wu
Jing Zhang
Shu Yang
Shu Yang
Jen-tse Huang
Wenxuan Wang
Wenxuan Wang
S. Wang
ELM
350
10
0
16 May 2025
EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety
EmoAgent: Assessing and Safeguarding Human-AI Interaction for Mental Health Safety
Jiahao Qiu
Yinghui He
Xinzhe Juan
Yun Wang
Wenshu Fan
Zixin Yao
Yue Wu
Xun Jiang
L. Yang
Mengdi Wang
AI4MH
453
12
0
13 Apr 2025
$\textit{Agents Under Siege}$: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt Attacks
Agents Under Siege\textit{Agents Under Siege}Agents Under Siege: Breaking Pragmatic Multi-Agent LLM Systems with Optimized Prompt AttacksAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Rana Muhammad Shahroz Khan
Zhen Tan
Sukwon Yun
Charles Flemming
Tianlong Chen
AAMLLLMAG
469
9
0
31 Mar 2025
Dialogue Injection Attack: Jailbreaking LLMs through Context Manipulation
Wenlong Meng
Fan Zhang
Wendao Yao
Zhenyuan Guo
Yongqian Li
Chengkun Wei
Wenzhi Chen
AAML
230
8
0
11 Mar 2025
This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs
Lorenz Wolf
Sangwoong Yoon
Ilija Bogunovic
187
0
0
07 Mar 2025
Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMs
Beyond Surface-Level Patterns: An Essence-Driven Defense Framework Against Jailbreak Attacks in LLMsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Shiyu Xiang
Ansen Zhang
Yanfei Cao
Yang Fan
Ronghao Chen
AAML
272
8
0
26 Feb 2025
When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search
When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided SearchNeural Information Processing Systems (NeurIPS), 2024
Xuan Chen
Yuzhou Nie
Wenbo Guo
Xiangyu Zhang
328
36
0
28 Jan 2025
PrisonBreak: Jailbreaking Large Language Models with at Most Twenty-Five Targeted Bit-flips
PrisonBreak: Jailbreaking Large Language Models with at Most Twenty-Five Targeted Bit-flips
Zachary Coalson
Jeonghyun Woo
Shiyang Chen
Yu Sun
Yu Sun
...
Lishan Yang
Gururaj Saileshwar
Prashant J. Nair
Bo Fang
Sanghyun Hong
AAML
419
0
0
10 Dec 2024
Diversity Helps Jailbreak Large Language Models
Diversity Helps Jailbreak Large Language ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Weiliang Zhao
Daniel Ben-Levi
Wei Hao
Junfeng Yang
Chengzhi Mao
AAML
931
3
0
06 Nov 2024
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks
BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak AttacksInternational Conference on Learning Representations (ICLR), 2024
Yunhan Zhao
Xiang Zheng
Lin Luo
Yige Li
Xingjun Ma
Yu-Gang Jiang
VLMAAML
244
15
0
28 Oct 2024
Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence
Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence
Shangbin Feng
Zifeng Wang
Yike Wang
Sayna Ebrahimi
Hamid Palangi
...
Nathalie Rauschmayr
Yejin Choi
Yulia Tsvetkov
Zifeng Wang
Tomas Pfister
MoMe
242
16
0
15 Oct 2024
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Bridging Today and the Future of Humanity: AI Safety in 2024 and Beyond
Shanshan Han
471
1
0
09 Oct 2024
VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data
VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data
Xuefeng Du
Reshmi Ghosh
Robert Sim
Ahmed Salem
Vitor Carvalho
Emily Lawton
Yixuan Li
Jack W. Stokes
VLMAAML
183
14
0
01 Oct 2024
A Voter-Based Stochastic Rejection-Method Framework for Asymptotically Safe Language Model Outputs
A Voter-Based Stochastic Rejection-Method Framework for Asymptotically Safe Language Model Outputs
Jake R. Watts
Joel Sokol
126
0
0
24 Jul 2024
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Apurv Verma
Satyapriya Krishna
Sebastian Gehrmann
Madhavan Seshadri
Anu Pradhan
Tom Ault
Leslie Barrett
David Rabinowitz
John Doucette
Nhathai Phan
328
36
0
20 Jul 2024
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
Haibo Jin
Leyang Hu
Xinuo Li
Peiyan Zhang
Chonghan Chen
Jun Zhuang
Haohan Wang
PILM
319
56
0
26 Jun 2024
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
Delong Ran
Jinyuan Liu
Yichen Gong
Jingyi Zheng
Xinlei He
Tianshuo Cong
Anyu Wang
ELM
349
21
0
13 Jun 2024
A Survey of Language-Based Communication in Robotics
A Survey of Language-Based Communication in Robotics
William Hunt
Sarvapali D. Ramchurn
Mohammad D. Soorati
LM&Ro
608
15
0
06 Jun 2024
AI Agents Under Threat: A Survey of Key Security Challenges and Future
  Pathways
AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways
Zehang Deng
Yongjian Guo
Changzhou Han
Wanlun Ma
Junwu Xiong
Sheng Wen
Yang Xiang
341
113
0
04 Jun 2024
Guardrail Baselines for Unlearning in LLMs
Guardrail Baselines for Unlearning in LLMs
Pratiksha Thaker
Yash Maurya
Shengyuan Hu
Zhiwei Steven Wu
Virginia Smith
MU
296
75
0
05 Mar 2024
Demystifying RCE Vulnerabilities in LLM-Integrated Apps
Demystifying RCE Vulnerabilities in LLM-Integrated AppsConference on Computer and Communications Security (CCS), 2023
Tong Liu
Zizhuang Deng
Guozhu Meng
Yuekang Li
Kai Chen
SILM
468
47
0
06 Sep 2023
1