Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2310.03693
Cited By
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
5 October 2023
Xiangyu Qi
Yi Zeng
Tinghao Xie
Pin-Yu Chen
Ruoxi Jia
Prateek Mittal
Peter Henderson
SILM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!"
50 / 395 papers shown
Title
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Ling Liu
AAML
38
21
0
26 Sep 2024
A Survey on Offensive AI Within Cybersecurity
Sahil Girhepuje
Aviral Verma
Gaurav Raina
AAML
15
2
0
26 Sep 2024
MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks
Giandomenico Cornacchia
Giulio Zizzo
Kieran Fraser
Muhammad Zaid Hameed
Ambrish Rawat
Mark Purcell
16
1
0
26 Sep 2024
Elephant in the Room: Unveiling the Impact of Reward Model Quality in Alignment
Yan Liu
Xiaoyuan Yi
Xiaokang Chen
Jing Yao
Jingwei Yi
Daoguang Zan
Zheng Liu
Xing Xie
Tsung-Yi Ho
ALM
26
0
0
26 Sep 2024
An Adversarial Perspective on Machine Unlearning for AI Safety
Jakub Łucki
Boyi Wei
Yangsibo Huang
Peter Henderson
F. Tramèr
Javier Rando
MU
AAML
66
31
0
26 Sep 2024
LLM Echo Chamber: personalized and automated disinformation
Tony Ma
16
0
0
24 Sep 2024
Backtracking Improves Generation Safety
Yiming Zhang
Jianfeng Chi
Hailey Nguyen
Kartikeya Upasani
Daniel M. Bikel
Jason Weston
Eric Michael Smith
SILM
34
5
0
22 Sep 2024
PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach
Zhihao Lin
Wei Ma
Mingyi Zhou
Yanjie Zhao
Haoyu Wang
Yang Liu
Jun Wang
Li Li
AAML
30
5
0
21 Sep 2024
Towards LifeSpan Cognitive Systems
Yu Wang
Chi Han
Tongtong Wu
Xiaoxin He
Wangchunshu Zhou
...
Zexue He
Wei Wang
Gholamreza Haffari
Heng Ji
Julian McAuley
KELM
CLL
83
1
0
20 Sep 2024
Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning
Essa Jan
Nouar Aldahoul
Moiz Ali
Faizan Ahmad
Fareed Zaffar
Yasir Zaki
18
3
0
18 Sep 2024
Prompt Obfuscation for Large Language Models
David Pape
Thorsten Eisenhofer
Thorsten Eisenhofer
Lea Schönherr
AAML
31
2
0
17 Sep 2024
Householder Pseudo-Rotation: A Novel Approach to Activation Editing in LLMs with Direction-Magnitude Perspective
Van-Cuong Pham
Thien Huu Nguyen
LLMSV
33
3
0
16 Sep 2024
ValueCompass: A Framework for Measuring Contextual Value Alignment Between Human and LLMs
Hua Shen
Tiffany Knearem
Reshmi Ghosh
Yu-Ju Yang
Tanushree Mitra
Yun Huang
Yun Huang
35
0
0
15 Sep 2024
Improving governance outcomes through AI documentation: Bridging theory and practice
Amy A. Winecoff
Miranda Bogen
16
2
0
13 Sep 2024
Alignment of Diffusion Models: Fundamentals, Challenges, and Future
Buhua Liu
Shitong Shao
Bao Li
Lichen Bai
Zhiqiang Xu
Haoyi Xiong
James Kwok
Sumi Helal
Zeke Xie
37
11
0
11 Sep 2024
DiPT: Enhancing LLM reasoning through diversified perspective-taking
H. Just
Mahavir Dabas
Lifu Huang
Ming Jin
Ruoxi Jia
LRM
32
1
0
10 Sep 2024
Exploring Straightforward Conversational Red-Teaming
George Kour
Naama Zwerdling
Marcel Zalmanovici
Ateret Anaby-Tavor
Ora Nova Fandina
E. Farchi
AAML
42
1
0
07 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILM
AAML
47
1
0
05 Sep 2024
The Dark Side of Human Feedback: Poisoning Large Language Models via User Inputs
Bocheng Chen
Hanqing Guo
Guangjing Wang
Yuanda Wang
Qiben Yan
AAML
37
4
0
01 Sep 2024
Acceptable Use Policies for Foundation Models
Kevin Klyman
23
14
0
29 Aug 2024
FRACTURED-SORRY-Bench: Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench
Aman Priyanshu
Supriti Vijay
AAML
23
1
0
28 Aug 2024
Legilimens: Practical and Unified Content Moderation for Large Language Model Services
Jialin Wu
Jiangyi Deng
Shengyuan Pang
Yanjiao Chen
Jiayang Xu
Xinfeng Li
Wenyuan Xu
32
6
0
28 Aug 2024
Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models
Wenxuan Zhang
Philip H. S. Torr
Mohamed Elhoseiny
Adel Bibi
43
9
0
27 Aug 2024
Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation
Haoyu Wang
Bingzhe Wu
Yatao Bian
Yongzhe Chang
Xueqian Wang
Peilin Zhao
51
2
0
20 Aug 2024
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning
Tiansheng Huang
Gautam Bhattacharya
Pratik Joshi
Josh Kimball
Ling Liu
AAML
MoMe
47
16
0
18 Aug 2024
How Susceptible are LLMs to Influence in Prompts?
Sotiris Anagnostidis
Jannis Bulian
LRM
25
16
0
17 Aug 2024
MIA-Tuner: Adapting Large Language Models as Pre-training Text Detector
Wenjie Fu
Huandong Wang
Chen Gao
Guanghua Liu
Yong Li
Tao Jiang
21
3
0
16 Aug 2024
Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks
Jiawei Zhao
Kejiang Chen
Xiaojian Yuan
Weiming Zhang
AAML
23
2
0
15 Aug 2024
Multi-Turn Context Jailbreak Attack on Large Language Models From First Principles
Xiongtao Sun
Deyue Zhang
Dongdong Yang
Quanchen Zou
Hui Li
AAML
19
11
0
08 Aug 2024
WalledEval: A Comprehensive Safety Evaluation Toolkit for Large Language Models
Prannaya Gupta
Le Qi Yau
Hao Han Low
I-Shiang Lee
Hugo Maximus Lim
...
Jia Hng Koh
Dar Win Liew
Rishabh Bhardwaj
Rajat Bhardwaj
Soujanya Poria
ELM
LM&MA
44
4
0
07 Aug 2024
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
Jingtong Su
Mingyu Lee
SangKeun Lee
30
7
0
02 Aug 2024
Tamper-Resistant Safeguards for Open-Weight LLMs
Rishub Tamirisa
Bhrugu Bharathi
Long Phan
Andy Zhou
Alice Gatti
...
Andy Zou
Dawn Song
Bo Li
Dan Hendrycks
Mantas Mazeika
AAML
MU
42
36
0
01 Aug 2024
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Richard Ren
Steven Basart
Adam Khoja
Alice Gatti
Long Phan
...
Alexander Pan
Gabriel Mukobi
Ryan H. Kim
Stephen Fitz
Dan Hendrycks
ELM
26
19
0
31 Jul 2024
Machine Unlearning in Generative AI: A Survey
Zheyuan Liu
Guangyao Dou
Zhaoxuan Tan
Yijun Tian
Meng-Long Jiang
MU
29
13
0
30 Jul 2024
Can Editing LLMs Inject Harm?
Canyu Chen
Baixiang Huang
Zekun Li
Zhaorun Chen
Shiyang Lai
...
Xifeng Yan
William Wang
Philip H. S. Torr
Dawn Song
Kai Shu
KELM
38
11
0
29 Jul 2024
The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models
Zihui Wu
Haichang Gao
Jianping He
Ping Wang
17
6
0
25 Jul 2024
Know Your Limits: A Survey of Abstention in Large Language Models
Bingbing Wen
Jihan Yao
Shangbin Feng
Chenjun Xu
Yulia Tsvetkov
Bill Howe
Lucy Lu Wang
49
5
0
25 Jul 2024
Can Large Language Models Automatically Jailbreak GPT-4V?
Yuanwei Wu
Yue Huang
Yixin Liu
Xiang Li
Pan Zhou
Lichao Sun
SILM
30
1
0
23 Jul 2024
RedAgent: Red Teaming Large Language Models with Context-aware Autonomous Language Agent
Huiyu Xu
Wenhui Zhang
Zhibo Wang
Feng Xiao
Rui Zheng
Yunhe Feng
Zhongjie Ba
Kui Ren
AAML
LLMAG
26
11
0
23 Jul 2024
AI Act for the Working Programmer
Holger Hermanns
Anne Lauber-Rönsberg
Philip Meinel
Sarah Sterz
Hanwei Zhang
25
1
0
23 Jul 2024
LLMmap: Fingerprinting For Large Language Models
Dario Pasquini
Evgenios M. Kornaropoulos
G. Ateniese
50
6
0
22 Jul 2024
Human-Interpretable Adversarial Prompt Attack on Large Language Models with Situational Context
Nilanjana Das
Edward Raff
Manas Gaur
AAML
29
2
0
19 Jul 2024
The Better Angels of Machine Personality: How Personality Relates to LLM Safety
Jie M. Zhang
Dongrui Liu
Chao Qian
Ziyue Gan
Yong-jin Liu
Yu Qiao
Jing Shao
LLMAG
PILM
40
12
0
17 Jul 2024
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training
Youliang Yuan
Wenxiang Jiao
Wenxuan Wang
Jen-tse Huang
Jiahao Xu
Tian Liang
Pinjia He
Zhaopeng Tu
30
19
0
12 Jul 2024
Multilingual Blending: LLM Safety Alignment Evaluation with Language Mixture
Jiayang Song
Yuheng Huang
Zhehua Zhou
Lei Ma
37
6
0
10 Jul 2024
Grounding and Evaluation for Large Language Models: Practical Challenges and Lessons Learned (Survey)
K. Kenthapadi
M. Sameki
Ankur Taly
HILM
ELM
AILaw
34
12
0
10 Jul 2024
ChatGPT Doesn't Trust Chargers Fans: Guardrail Sensitivity in Context
Victoria R. Li
Yida Chen
Naomi Saphra
27
3
0
09 Jul 2024
Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders
Jinseok Kim
Jaewon Jung
Sangyeop Kim
S. Park
Sungzoon Cho
42
0
0
09 Jul 2024
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
Sibo Yi
Yule Liu
Zhen Sun
Tianshuo Cong
Xinlei He
Jiaxing Song
Ke Xu
Qi Li
AAML
34
77
0
05 Jul 2024
Securing Multi-turn Conversational Language Models Against Distributed Backdoor Triggers
Terry Tong
Jiashu Xu
Qin Liu
Muhao Chen
AAML
SILM
30
1
0
04 Jul 2024
Previous
1
2
3
4
5
6
7
8
Next