ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.02151
  4. Cited By
Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

2 April 2024
Maksym Andriushchenko
Francesco Croce
Nicolas Flammarion
    AAML
ArXivPDFHTML

Papers citing "Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks"

50 / 124 papers shown
Title
Global Challenge for Safe and Secure LLMs Track 1
Global Challenge for Safe and Secure LLMs Track 1
Xiaojun Jia
Yihao Huang
Yang Liu
Peng Yan Tan
Weng Kuan Yau
...
Yan Wang
Rick Siow Mong Goh
Liangli Zhen
Yingjie Zhang
Zhe Zhao
ELM
AILaw
59
0
0
21 Nov 2024
Rethinking the Intermediate Features in Adversarial Attacks: Misleading
  Robotic Models via Adversarial Distillation
Rethinking the Intermediate Features in Adversarial Attacks: Misleading Robotic Models via Adversarial Distillation
Ke Zhao
Huayang Huang
Miao Li
Yu Wu
AAML
66
0
0
21 Nov 2024
Diversity Helps Jailbreak Large Language Models
Diversity Helps Jailbreak Large Language Models
Weiliang Zhao
Daniel Ben-Levi
Wei Hao
Junfeng Yang
Chengzhi Mao
AAML
48
0
0
06 Nov 2024
Plentiful Jailbreaks with String Compositions
Plentiful Jailbreaks with String Compositions
Brian R. Y. Huang
AAML
41
2
0
01 Nov 2024
Responsible Retrieval Augmented Generation for Climate Decision Making
  from Documents
Responsible Retrieval Augmented Generation for Climate Decision Making from Documents
Matyas Juhasz
Kalyan Dutia
Henry Franks
Conor Delahunty
Patrick Fawbert Mills
Harrison Pim
24
1
0
31 Oct 2024
Effective and Efficient Adversarial Detection for Vision-Language Models
  via A Single Vector
Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector
Youcheng Huang
Fengbin Zhu
Jingkun Tang
Pan Zhou
Wenqiang Lei
Jiancheng Lv
Tat-Seng Chua
AAML
23
4
0
30 Oct 2024
Benchmarking LLM Guardrails in Handling Multilingual Toxicity
Benchmarking LLM Guardrails in Handling Multilingual Toxicity
Yahan Yang
Soham Dan
Dan Roth
Insup Lee
22
5
0
29 Oct 2024
Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring
Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring
Honglin Mu
Han He
Yuxin Zhou
Yunlong Feng
Yang Xu
...
Zeming Liu
Xudong Han
Qi Shi
Qingfu Zhu
Wanxiang Che
AAML
21
1
0
28 Oct 2024
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities
Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities
Chung-En Sun
Xiaodong Liu
Weiwei Yang
Tsui-Wei Weng
Hao Cheng
Aidan San
Michel Galley
Jianfeng Gao
32
2
0
24 Oct 2024
Dynamic Guided and Domain Applicable Safeguards for Enhanced Security in Large Language Models
Dynamic Guided and Domain Applicable Safeguards for Enhanced Security in Large Language Models
He Cao
Weidi Luo
Zijing Liu
Yu Wang
Bing Feng
Yuan Yao
Yuan Yao
Yu Li
AAML
45
2
0
23 Oct 2024
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation
Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation
Qizhang Li
Xiaochen Yang
W. Zuo
Yiwen Guo
AAML
59
0
0
15 Oct 2024
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention
  Manipulation
AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation
Zijun Wang
Haoqin Tu
J. Mei
Bingchen Zhao
Y. Wang
Cihang Xie
16
5
0
11 Oct 2024
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent
  Enhanced Explanation Evaluation Framework
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework
Fan Liu
Yue Feng
Zhao Xu
Lixin Su
Xinyu Ma
Dawei Yin
Hao Liu
ELM
19
6
0
11 Oct 2024
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
Priyanshu Kumar
Elaine Lau
Saranya Vijayakumar
Tu Trinh
Scale Red Team
...
Sean Hendryx
Shuyan Zhou
Matt Fredrikson
Summer Yue
Zifan Wang
LLMAG
21
17
0
11 Oct 2024
Recent advancements in LLM Red-Teaming: Techniques, Defenses, and
  Ethical Considerations
Recent advancements in LLM Red-Teaming: Techniques, Defenses, and Ethical Considerations
Tarun Raheja
Nilay Pochhi
AAML
43
1
0
09 Oct 2024
ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time
ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time
Yi Ding
Bolian Li
Ruqi Zhang
MLLM
51
4
0
09 Oct 2024
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min-Bin Lin
25
8
0
09 Oct 2024
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy
Instructional Segment Embedding: Improving LLM Safety with Instruction Hierarchy
Tong Wu
Shujian Zhang
Kaiqiang Song
Silei Xu
Sanqiang Zhao
Ravi Agrawal
Sathish Indurthi
Chong Xiang
Prateek Mittal
Wenxuan Zhou
19
7
0
09 Oct 2024
Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks
Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks
Zi Wang
Divyam Anshumaan
Ashish Hooda
Yudong Chen
Somesh Jha
AAML
35
0
0
05 Oct 2024
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models
Guobin Shen
Dongcheng Zhao
Yiting Dong
Xiang-Yu He
Yi Zeng
AAML
42
0
0
03 Oct 2024
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester
Automated Red Teaming with GOAT: the Generative Offensive Agent Tester
Maya Pavlova
Erik Brinkman
Krithika Iyer
Vítor Albiero
Joanna Bitton
Hailey Nguyen
J. Li
Cristian Canton Ferrer
Ivan Evtimov
Aaron Grattafiori
ALM
17
6
0
02 Oct 2024
Endless Jailbreaks with Bijection Learning
Endless Jailbreaks with Bijection Learning
Brian R. Y. Huang
Maximilian Li
Leonard Tang
AAML
54
5
0
02 Oct 2024
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
HarmAug: Effective Data Augmentation for Knowledge Distillation of Safety Guard Models
Seanie Lee
Haebin Seong
Dong Bok Lee
Minki Kang
Xiaoyin Chen
Dominik Wagner
Yoshua Bengio
Juho Lee
Sung Ju Hwang
48
2
0
02 Oct 2024
Robust LLM safeguarding via refusal feature adversarial training
Robust LLM safeguarding via refusal feature adversarial training
L. Yu
Virginie Do
Karen Hambardzumyan
Nicola Cancedda
AAML
42
9
0
30 Sep 2024
RED QUEEN: Safeguarding Large Language Models against Concealed
  Multi-Turn Jailbreaking
RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking
Yifan Jiang
Kriti Aggarwal
Tanmay Laud
Kashif Munir
Jay Pujara
Subhabrata Mukherjee
AAML
34
10
0
26 Sep 2024
An Adversarial Perspective on Machine Unlearning for AI Safety
An Adversarial Perspective on Machine Unlearning for AI Safety
Jakub Łucki
Boyi Wei
Yangsibo Huang
Peter Henderson
F. Tramèr
Javier Rando
MU
AAML
54
31
0
26 Sep 2024
Backtracking Improves Generation Safety
Backtracking Improves Generation Safety
Yiming Zhang
Jianfeng Chi
Hailey Nguyen
Kartikeya Upasani
Daniel M. Bikel
Jason Weston
Eric Michael Smith
SILM
24
5
0
22 Sep 2024
Extracting Memorized Training Data via Decomposition
Extracting Memorized Training Data via Decomposition
Ellen Su
Anu Vellore
Amy Chang
Raffaele Mura
Blaine Nelson
Paul Kassianik
Amin Karbasi
14
2
0
18 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language
  Models
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILM
AAML
45
1
0
05 Sep 2024
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals
  in Large Language Models
Automatic Pseudo-Harmful Prompt Generation for Evaluating False Refusals in Large Language Models
Bang An
Sicheng Zhu
Ruiyi Zhang
Michael-Andrei Panaitescu-Liess
Yuancheng Xu
Furong Huang
AAML
31
10
0
01 Sep 2024
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Nathaniel Li
Ziwen Han
Ian Steneker
Willow Primack
Riley Goodside
Hugh Zhang
Zifan Wang
Cristina Menghini
Summer Yue
AAML
MU
29
38
0
27 Aug 2024
Machine Unlearning in Generative AI: A Survey
Machine Unlearning in Generative AI: A Survey
Zheyuan Liu
Guangyao Dou
Zhaoxuan Tan
Yijun Tian
Meng-Long Jiang
MU
16
13
0
30 Jul 2024
Revisiting the Robust Alignment of Circuit Breakers
Revisiting the Robust Alignment of Circuit Breakers
Leo Schwinn
Simon Geisler
AAML
24
4
0
22 Jul 2024
Operationalizing a Threat Model for Red-Teaming Large Language Models
  (LLMs)
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Apurv Verma
Satyapriya Krishna
Sebastian Gehrmann
Madhavan Seshadri
Anu Pradhan
Tom Ault
Leslie Barrett
David Rabinowitz
John Doucette
Nhathai Phan
47
6
0
20 Jul 2024
Human-Interpretable Adversarial Prompt Attack on Large Language Models
  with Situational Context
Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context
Nilanjana Das
Edward Raff
Manas Gaur
AAML
16
2
0
19 Jul 2024
Does Refusal Training in LLMs Generalize to the Past Tense?
Does Refusal Training in LLMs Generalize to the Past Tense?
Maksym Andriushchenko
Nicolas Flammarion
34
25
0
16 Jul 2024
Systematic Categorization, Construction and Evaluation of New Attacks against Multi-modal Mobile GUI Agents
Systematic Categorization, Construction and Evaluation of New Attacks against Multi-modal Mobile GUI Agents
Yulong Yang
Xinshan Yang
Shuaidong Li
Chenhao Lin
Zhengyu Zhao
Chao Shen
Tianwei Zhang
35
0
0
12 Jul 2024
AI Safety in Generative AI Large Language Models: A Survey
AI Safety in Generative AI Large Language Models: A Survey
Jaymari Chua
Yun Yvonna Li
Shiyi Yang
Chen Wang
Lina Yao
LM&MA
29
12
0
06 Jul 2024
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Sibo Yi
Yule Liu
Zhen Sun
Tianshuo Cong
Xinlei He
Jiaxing Song
Ke Xu
Qi Li
AAML
27
77
0
05 Jul 2024
LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content
  Moderation of Large Language Models
LoRA-Guard: Parameter-Efficient Guardrail Adaptation for Content Moderation of Large Language Models
Hayder Elesedy
Pedro M. Esperança
Silviu Vlad Oprea
Mete Ozay
KELM
16
2
0
03 Jul 2024
Virtual Context: Enhancing Jailbreak Attacks with Special Token
  Injection
Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection
Yuqi Zhou
Lin Lu
Hanchi Sun
Pan Zhou
Lichao Sun
21
9
0
28 Jun 2024
From Distributional to Overton Pluralism: Investigating Large Language Model Alignment
From Distributional to Overton Pluralism: Investigating Large Language Model Alignment
Thom Lake
Eunsol Choi
Greg Durrett
37
9
0
25 Jun 2024
Steering Without Side Effects: Improving Post-Deployment Control of
  Language Models
Steering Without Side Effects: Improving Post-Deployment Control of Language Models
Asa Cooper Stickland
Alexander Lyzhov
Jacob Pfau
Salsabila Mahdi
Samuel R. Bowman
LLMSV
AAML
40
17
0
21 Jun 2024
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal
Tinghao Xie
Xiangyu Qi
Yi Zeng
Yangsibo Huang
Udari Madhushani Sehwag
...
Bo Li
Kai Li
Danqi Chen
Peter Henderson
Prateek Mittal
ALM
ELM
41
50
0
20 Jun 2024
Refusal in Language Models Is Mediated by a Single Direction
Refusal in Language Models Is Mediated by a Single Direction
Andy Arditi
Oscar Obeso
Aaquib Syed
Daniel Paleka
Nina Panickssery
Wes Gurnee
Neel Nanda
42
130
0
17 Jun 2024
Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces
Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces
Yihuai Hong
Lei Yu
Shauli Ravfogel
Haiqin Yang
Mor Geva
KELM
MU
45
17
0
17 Jun 2024
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs
Zhao Xu
Fan Liu
Hao Liu
AAML
27
7
0
13 Jun 2024
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
Delong Ran
Jinyuan Liu
Yichen Gong
Jingyi Zheng
Xinlei He
Tianshuo Cong
Anyu Wang
ELM
29
10
0
13 Jun 2024
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag
  Competition
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
Edoardo Debenedetti
Javier Rando
Daniel Paleka
Silaghi Fineas Florin
Dragos Albastroiu
...
Stefan Kraft
Mario Fritz
Florian Tramèr
Sahar Abdelnabi
Lea Schonherr
36
9
0
12 Jun 2024
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
Xiangyu Qi
Ashwinee Panda
Kaifeng Lyu
Xiao Ma
Subhrajit Roy
Ahmad Beirami
Prateek Mittal
Peter Henderson
24
69
0
10 Jun 2024
Previous
123
Next