Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2311.07689
Cited By
MART: Improving LLM Safety with Multi-round Automatic Red-Teaming
13 November 2023
Suyu Ge
Chunting Zhou
Rui Hou
Madian Khabsa
Yi-Chia Wang
Qifan Wang
Jiawei Han
Yuning Mao
AAML
LRM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"MART: Improving LLM Safety with Multi-round Automatic Red-Teaming"
50 / 70 papers shown
Title
Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors
Ren-Wei Liang
Chin-Ting Hsu
Chan-Hung Yu
Saransh Agrawal
Shih-Cheng Huang
Shang-Tse Chen
Kuan-Hao Huang
Shao-Hua Sun
76
0
0
27 Apr 2025
RAG LLMs are Not Safer: A Safety Analysis of Retrieval-Augmented Generation for Large Language Models
Bang An
Shiyue Zhang
Mark Dredze
61
0
0
25 Apr 2025
Steering the CensorShip: Uncovering Representation Vectors for LLM "Thought" Control
Hannah Cyberey
David E. Evans
LLMSV
74
0
0
23 Apr 2025
RainbowPlus: Enhancing Adversarial Prompt Generation via Evolutionary Quality-Diversity Search
Quy-Anh Dang
Chris Ngo
Truong Son-Hy
AAML
SyDa
33
0
0
21 Apr 2025
SaRO: Enhancing LLM Safety through Reasoning-based Alignment
Yutao Mou
Yuxiao Luo
Shikun Zhang
Wei Ye
LLMSV
LRM
36
0
0
13 Apr 2025
Geneshift: Impact of different scenario shift on Jailbreaking LLM
Tianyi Wu
Zhiwei Xue
Yue Liu
Jiaheng Zhang
Bryan Hooi
See-Kiong Ng
36
0
0
10 Apr 2025
AutoRedTeamer: Autonomous Red Teaming with Lifelong Attack Integration
Andy Zhou
Kevin E. Wu
Francesco Pinto
Z. Chen
Yi Zeng
Yu Yang
Shuang Yang
Sanmi Koyejo
James Zou
Bo Li
LLMAG
AAML
75
0
0
20 Mar 2025
A Framework to Assess Multilingual Vulnerabilities of LLMs
Likai Tang
Niruth Bogahawatta
Yasod Ginige
Jiarui Xu
Shixuan Sun
Surangika Ranathunga
Suranga Seneviratne
37
0
0
17 Mar 2025
Cross-Examiner: Evaluating Consistency of Large Language Model-Generated Explanations
Danielle Villa
Maria Chang
K. Murugesan
Rosario A. Uceda-Sosa
K. Ramamurthy
LRM
50
0
0
11 Mar 2025
Dialogue Injection Attack: Jailbreaking LLMs through Context Manipulation
Wenlong Meng
Fan Zhang
Wendao Yao
Zhenyuan Guo
Y. Li
Chengkun Wei
Wenzhi Chen
AAML
38
1
0
11 Mar 2025
Building Safe GenAI Applications: An End-to-End Overview of Red Teaming for Large Language Models
Alberto Purpura
Sahil Wadhwa
Jesse Zymet
Akshay Gupta
Andy Luo
Melissa Kazemi Rad
Swapnil Shinde
Mohammad Sorower
AAML
129
0
0
03 Mar 2025
Foot-In-The-Door: A Multi-turn Jailbreak for LLMs
Zixuan Weng
Xiaolong Jin
Jinyuan Jia
X. Zhang
AAML
102
0
0
27 Feb 2025
Equilibrate RLHF: Towards Balancing Helpfulness-Safety Trade-off in Large Language Models
Yingshui Tan
Yilei Jiang
Y. Li
J. Liu
Xingyuan Bu
Wenbo Su
Xiangyu Yue
Xiaoyong Zhu
Bo Zheng
ALM
80
1
0
17 Feb 2025
Jailbreaking to Jailbreak
Jeremy Kritz
Vaughn Robinson
Robert Vacareanu
Bijan Varjavand
Michael Choi
Bobby Gogov
Scale Red Team
Summer Yue
Willow Primack
Zifan Wang
160
1
0
09 Feb 2025
ChatBug: A Common Vulnerability of Aligned LLMs Induced by Chat Templates
Fengqing Jiang
Zhangchen Xu
Luyao Niu
Bill Yuchen Lin
Radha Poovendran
SILM
68
5
0
08 Jan 2025
Diverse and Effective Red Teaming with Auto-generated Rewards and Multi-step Reinforcement Learning
Alex Beutel
Kai Y. Xiao
Johannes Heidecke
Lilian Weng
AAML
43
3
0
24 Dec 2024
Private Yet Social: How LLM Chatbots Support and Challenge Eating Disorder Recovery
Ryuhaerang Choi
Taehan Kim
Subin Park
Jennifer G Kim
Sung-Ju Lee
AI4MH
69
0
0
16 Dec 2024
Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness
Avinash Amballa
Durga Sandeep Saluru
Gayathri Akkinapalli
Abhishek Sureddy
Akshay Kumar Sureddy
ALM
80
0
0
26 Nov 2024
In-Context Experience Replay Facilitates Safety Red-Teaming of Text-to-Image Diffusion Models
Zhi-Yi Chin
Kuan-Chen Mu
Mario Fritz
Pin-Yu Chen
DiffM
83
0
0
25 Nov 2024
New Emerged Security and Privacy of Pre-trained Model: a Survey and Outlook
Meng Yang
Tianqing Zhu
Chi Liu
Wanlei Zhou
Shui Yu
Philip S. Yu
AAML
ELM
PILM
56
1
0
12 Nov 2024
Replace-then-Perturb: Targeted Adversarial Attacks With Visual Reasoning for Vision-Language Models
Jonggyu Jang
Hyeonsu Lyu
Jungyeon Koh
H. Yang
VLM
AAML
29
0
0
01 Nov 2024
AdvWeb: Controllable Black-box Attacks on VLM-powered Web Agents
Chejian Xu
Mintong Kang
Jiawei Zhang
Zeyi Liao
Lingbo Mo
Mengqi Yuan
Huan Sun
Bo Li
AAML
22
11
0
22 Oct 2024
To Err is AI : A Case Study Informing LLM Flaw Reporting Practices
Sean McGregor
Allyson Ettinger
Nick Judd
Paul Albee
Liwei Jiang
...
Avijit Ghosh
Christopher Fiorelli
Michelle Hoang
Sven Cattell
Nouha Dziri
27
2
0
15 Oct 2024
Refusal-Trained LLMs Are Easily Jailbroken As Browser Agents
Priyanshu Kumar
Elaine Lau
Saranya Vijayakumar
Tu Trinh
Scale Red Team
...
Sean Hendryx
Shuyan Zhou
Matt Fredrikson
Summer Yue
Zifan Wang
LLMAG
34
17
0
11 Oct 2024
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models
Fei Wang
Ninareh Mehrabi
Palash Goyal
Rahul Gupta
Kai-Wei Chang
Aram Galstyan
ALM
40
1
0
07 Oct 2024
ASPIRER: Bypassing System Prompts With Permutation-based Backdoors in LLMs
Lu Yan
Siyuan Cheng
Xuan Chen
Kaiyuan Zhang
Guangyu Shen
Zhuo Zhang
Xiangyu Zhang
AAML
SILM
18
0
0
05 Oct 2024
Examining the Role of Relationship Alignment in Large Language Models
Kristen M. Altenburger
Hongda Jiang
Robert E. Kraut
Yi-Chia Wang
Jane Dwivedi-Yu
19
0
0
02 Oct 2024
FlipAttack: Jailbreak LLMs via Flipping
Yue Liu
Xiaoxin He
Miao Xiong
Jinlan Fu
Shumin Deng
Bryan Hooi
AAML
34
12
0
02 Oct 2024
Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction
Jinchuan Zhang
Yan Zhou
Yaxin Liu
Ziming Li
Songlin Hu
AAML
26
3
0
25 Sep 2024
Alignment with Preference Optimization Is All You Need for LLM Safety
Réda Alami
Ali Khalifa Almansoori
Ahmed Alzubaidi
M. Seddik
Mugariya Farooq
Hakim Hacid
21
1
0
12 Sep 2024
Doppelgänger's Watch: A Split Objective Approach to Large Language Models
S. Ghasemlou
Ashish Katiyar
Aparajita Saraf
Seungwhan Moon
Mangesh Pujari
Pinar E. Donmez
Babak Damavandi
Anuj Kumar
36
0
0
09 Sep 2024
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Nathaniel Li
Ziwen Han
Ian Steneker
Willow Primack
Riley Goodside
Hugh Zhang
Zifan Wang
Cristina Menghini
Summer Yue
AAML
MU
44
39
0
27 Aug 2024
Ferret: Faster and Effective Automated Red Teaming with Reward-Based Scoring Technique
Tej Deep Pala
Vernon Y.H. Toh
Rishabh Bhardwaj
Soujanya Poria
AAML
13
2
0
20 Aug 2024
Can Reinforcement Learning Unlock the Hidden Dangers in Aligned Large Language Models?
Mohammad Bahrami Karkevandi
Nishant Vishwamitra
Peyman Najafirad
AAML
43
1
0
05 Aug 2024
SEAS: Self-Evolving Adversarial Safety Optimization for Large Language Models
Muxi Diao
Rumei Li
Shiyang Liu
Guogang Liao
Jingang Wang
Xunliang Cai
Weiran Xu
AAML
49
1
0
05 Aug 2024
Can LLMs be Fooled? Investigating Vulnerabilities in LLMs
Sara Abdali
Jia He
C. Barberan
Richard Anarfi
29
7
0
30 Jul 2024
Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Apurv Verma
Satyapriya Krishna
Sebastian Gehrmann
Madhavan Seshadri
Anu Pradhan
Tom Ault
Leslie Barrett
David Rabinowitz
John Doucette
Nhathai Phan
47
9
0
20 Jul 2024
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture
Jiayang Song
Yuheng Huang
Zhehua Zhou
Lei Ma
37
6
0
10 Jul 2024
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Sibo Yi
Yule Liu
Zhen Sun
Tianshuo Cong
Xinlei He
Jiaxing Song
Ke Xu
Qi Li
AAML
34
80
0
05 Jul 2024
The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm
Aakanksha
Arash Ahmadian
B. Ermiş
Seraphina Goldfarb-Tarrant
Julia Kreutzer
Marzieh Fadaee
Sara Hooker
40
28
0
26 Jun 2024
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models
Haibo Jin
Leyang Hu
Xinuo Li
Peiyan Zhang
Chonghan Chen
Jun Zhuang
Haohan Wang
PILM
36
26
0
26 Jun 2024
Toward Optimal LLM Alignments Using Two-Player Games
Rui Zheng
Hongyi Guo
Zhihan Liu
Xiaoying Zhang
Yuanshun Yao
...
Tao Gui
Qi Zhang
Xuanjing Huang
Hang Li
Yang Liu
58
5
0
16 Jun 2024
JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models
Delong Ran
Jinyuan Liu
Yichen Gong
Jingyi Zheng
Xinlei He
Tianshuo Cong
Anyu Wang
ELM
42
10
0
13 Jun 2024
AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens
Lin Lu
Hai Yan
Zenghui Yuan
Jiawen Shi
Wenqi Wei
Pin-Yu Chen
Pan Zhou
AAML
44
8
0
06 Jun 2024
Safeguarding Large Language Models: A Survey
Yi Dong
Ronghui Mu
Yanghao Zhang
Siqi Sun
Tianle Zhang
...
Yi Qi
Jinwei Hu
Jie Meng
Saddek Bensalem
Xiaowei Huang
OffRL
KELM
AILaw
35
17
0
03 Jun 2024
OR-Bench: An Over-Refusal Benchmark for Large Language Models
Justin Cui
Wei-Lin Chiang
Ion Stoica
Cho-Jui Hsieh
ALM
38
33
0
31 May 2024
ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users
Guanlin Li
Kangjie Chen
Shudong Zhang
Jie M. Zhang
Tianwei Zhang
EGVM
47
10
0
24 May 2024
ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation
Jingnan Zheng
Han Wang
An Zhang
Tai D. Nguyen
Jun Sun
Tat-Seng Chua
LLMAG
38
14
0
23 May 2024
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models
Raghuveer Peri
Sai Muralidhar Jayanthi
S. Ronanki
Anshu Bhatia
Karel Mundnich
...
Srikanth Vishnubhotla
Daniel Garcia-Romero
S. Srinivasan
Kyu J. Han
Katrin Kirchhoff
AAML
32
3
0
14 May 2024
Mitigating Exaggerated Safety in Large Language Models
Ruchi Bhalani
Ruchira Ray
21
1
0
08 May 2024
1
2
Next