Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
2408.00761
Cited By
v1
v2
v3
v4 (latest)
Tamper-Resistant Safeguards for Open-Weight LLMs
International Conference on Learning Representations (ICLR), 2024
1 August 2024
Rishub Tamirisa
Bhrugu Bharathi
Long Phan
Andy Zhou
Alice Gatti
Tarun Suresh
Maxwell Lin
Justin Wang
Rowan Wang
Ron Arel
Andy Zou
Dawn Song
Bo Li
Dan Hendrycks
Mantas Mazeika
AAML
MU
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Tamper-Resistant Safeguards for Open-Weight LLMs"
50 / 112 papers shown
Title
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Samuele Poppi
Zheng-Xin Yong
Yifei He
Bobbie Chern
Han Zhao
Aobo Yang
Jianfeng Chi
AAML
351
30
0
23 Oct 2024
Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization
Phillip Guo
Aaquib Syed
Abhay Sheshadri
Aidan Ewart
Gintare Karolina Dziugaite
KELM
MU
151
17
0
16 Oct 2024
Meta-Unlearning on Diffusion Models: Preventing Relearning Unlearned Concepts
Hongcheng Gao
Tianyu Pang
Chao Du
Taihang Hu
Zhijie Deng
Min Lin
DiffM
232
17
0
16 Oct 2024
Targeted Vaccine: Safety Alignment for Large Language Models against Harmful Fine-Tuning via Layer-wise Perturbation
IEEE Transactions on Information Forensics and Security (IEEE TIFS), 2024
Guozhi Liu
Weiwei Lin
Tiansheng Huang
Ruichao Mo
Qi Mu
Li Shen
AAML
346
28
0
13 Oct 2024
Do Unlearning Methods Remove Information from Language Model Weights?
Aghyad Deeb
Fabien Roger
AAML
MU
325
42
0
11 Oct 2024
A Closer Look at Machine Unlearning for Large Language Models
International Conference on Learning Representations (ICLR), 2024
Xiaojian Yuan
Tianyu Pang
Chao Du
Kejiang Chen
Weiming Zhang
Min Lin
MU
608
27
0
10 Oct 2024
OD-Stega: LLM-Based Near-Imperceptible Steganography via Optimized Distributions
Yu-Shin Huang
Peter Just
Krishna Narayanan
Chao Tian
246
15
0
06 Oct 2024
Position: LLM Unlearning Benchmarks are Weak Measures of Progress
Pratiksha Thaker
Shengyuan Hu
Neil Kale
Yash Maurya
Zhiwei Steven Wu
Virginia Smith
MU
284
31
0
03 Oct 2024
Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Ling Liu
AAML
352
78
0
26 Sep 2024
An Adversarial Perspective on Machine Unlearning for AI Safety
Jakub Łucki
Boyi Wei
Yangsibo Huang
Peter Henderson
F. Tramèr
Javier Rando
MU
AAML
823
80
0
26 Sep 2024
Backtracking Improves Generation Safety
Yiming Zhang
Jianfeng Chi
Hailey Nguyen
Kartikeya Upasani
Daniel M. Bikel
Jason Weston
Eric Michael Smith
SILM
271
24
0
22 Sep 2024
Recent Advances in Attack and Defense Approaches of Large Language Models
Jing Cui
Yishi Xu
Zhewei Huang
Shuchang Zhou
Jianbin Jiao
Junge Zhang
PILM
AAML
295
7
0
05 Sep 2024
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet
Nathaniel Li
Ziwen Han
Ian Steneker
Willow Primack
Riley Goodside
Hugh Zhang
Zifan Wang
Cristina Menghini
Summer Yue
AAML
MU
233
94
0
27 Aug 2024
Antidote: Post-fine-tuning Safety Alignment for Large Language Models against Harmful Fine-tuning
Tiansheng Huang
Gautam Bhattacharya
Pratik Joshi
Josh Kimball
Ling Liu
AAML
MoMe
459
45
0
18 Aug 2024
Safety Alignment Should Be Made More Than Just a Few Tokens Deep
International Conference on Learning Representations (ICLR), 2024
Xiangyu Qi
Ashwinee Panda
Kaifeng Lyu
Xiao Ma
Subhrajit Roy
Ahmad Beirami
Prateek Mittal
Peter Henderson
207
260
0
10 Jun 2024
Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters
Haibo Jin
Andy Zhou
Joe D. Menke
Haohan Wang
168
37
0
30 May 2024
Lazy Safety Alignment for Large Language Models against Harmful Fine-tuning
Tiansheng Huang
Sihao Hu
Fatih Ilhan
Selim Furkan Tekin
Ling Liu
512
37
0
28 May 2024
SOPHON: Non-Fine-Tunable Learning to Restrain Task Transferability For Pre-trained Models
Jiangyi Deng
Shengyuan Pang
Yanjiao Chen
Liangming Xia
Yijie Bai
Haiqin Weng
Wei Dong
AAML
234
15
0
19 Apr 2024
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content
Zhuowen Yuan
Zidi Xiong
Yi Zeng
Ning Yu
Ruoxi Jia
Basel Alomair
Yue Liu
AAML
KELM
213
62
0
19 Mar 2024
Defending Against Unforeseen Failure Modes with Latent Adversarial Training
Stephen Casper
Lennart Schulze
Oam Patel
Dylan Hadfield-Menell
AAML
598
56
0
08 Mar 2024
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
Nathaniel Li
Alexander Pan
Anjali Gopal
Summer Yue
Daniel Berrios
...
Yan Shoshitaishvili
Jimmy Ba
K. Esvelt
Alexandr Wang
Dan Hendrycks
ELM
610
291
0
05 Mar 2024
Eight Methods to Evaluate Robust Unlearning in LLMs
Aengus Lynch
Phillip Guo
Aidan Ewart
Stephen Casper
Dylan Hadfield-Menell
ELM
MU
298
115
0
26 Feb 2024
Immunization against harmful fine-tuning attacks
Domenic Rosati
Jan Wehner
Kai Williams
Lukasz Bartoszcze
Jan Batzner
Hassan Sajjad
Frank Rudzicz
AAML
209
31
0
26 Feb 2024
Rethinking Machine Unlearning for Large Language Models
Sijia Liu
Yuanshun Yao
Jinghan Jia
Stephen Casper
Nathalie Baracaldo
...
Hang Li
Kush R. Varshney
Mohit Bansal
Sanmi Koyejo
Yang Liu
AILaw
MU
346
191
0
13 Feb 2024
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
Mantas Mazeika
Long Phan
Xuwang Yin
Andy Zou
Zifan Wang
...
Nathaniel Li
Steven Basart
Bo Li
David A. Forsyth
Dan Hendrycks
AAML
314
685
0
06 Feb 2024
GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of Large Language Models
Haibo Jin
Ruoxi Chen
Peiyan Zhang
Andy Zhou
Yang Zhang
Haohan Wang
LLMAG
367
42
0
05 Feb 2024
Vaccine: Perturbation-aware Alignment for Large Language Model
Tiansheng Huang
Sihao Hu
Ling Liu
374
77
0
02 Feb 2024
Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks
Andy Zhou
Bo Li
Haohan Wang
AAML
353
125
0
30 Jan 2024
Removing RLHF Protections in GPT-4 via Fine-Tuning
North American Chapter of the Association for Computational Linguistics (NAACL), 2023
Qiusi Zhan
Richard Fang
R. Bindu
Akul Gupta
Tatsunori Hashimoto
Daniel Kang
MU
AAML
263
138
0
09 Nov 2023
DEPN: Detecting and Editing Privacy Neurons in Pretrained Language Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Xinwei Wu
Junzhuo Li
Minghui Xu
Weilong Dong
Shuangzhi Wu
Chao Bian
Deyi Xiong
MU
KELM
268
80
0
31 Oct 2023
Large Language Model Unlearning
Yuanshun Yao
Xiaojun Xu
Yang Liu
MU
330
220
0
14 Oct 2023
Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!
International Conference on Learning Representations (ICLR), 2023
Xiangyu Qi
Yi Zeng
Tinghao Xie
Pin-Yu Chen
Ruoxi Jia
Prateek Mittal
Peter Henderson
SILM
293
877
0
05 Oct 2023
Shadow Alignment: The Ease of Subverting Safely-Aligned Language Models
Xianjun Yang
Xiao Wang
Tao Gui
Linda R. Petzold
William Y. Wang
Xun Zhao
Dahua Lin
164
242
0
04 Oct 2023
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models
International Conference on Learning Representations (ICLR), 2023
Xiaogeng Liu
Nan Xu
Muhao Chen
Chaowei Xiao
SILM
257
526
0
03 Oct 2023
Who's Harry Potter? Approximate Unlearning in LLMs
Ronen Eldan
M. Russinovich
MU
MoMe
341
303
0
03 Oct 2023
Universal and Transferable Adversarial Attacks on Aligned Language Models
Andy Zou
Zifan Wang
Nicholas Carlini
Milad Nasr
J. Zico Kolter
Matt Fredrikson
571
2,185
0
27 Jul 2023
Large Language Models
Communications of the ACM (CACM), 2023
Michael R Douglas
LLMAG
LM&MA
519
902
0
11 Jul 2023
Jailbroken: How Does LLM Safety Training Fail?
Neural Information Processing Systems (NeurIPS), 2023
Alexander Wei
Nika Haghtalab
Jacob Steinhardt
581
1,343
0
05 Jul 2023
LEACE: Perfect linear concept erasure in closed form
Neural Information Processing Systems (NeurIPS), 2023
Nora Belrose
David Schneider-Joseph
Shauli Ravfogel
Robert Bamler
Edward Raff
Stella Biderman
KELM
MU
657
163
0
06 Jun 2023
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Neural Information Processing Systems (NeurIPS), 2023
Rafael Rafailov
Archit Sharma
E. Mitchell
Stefano Ermon
Christopher D. Manning
Chelsea Finn
ALM
739
6,364
0
29 May 2023
Self-Destructing Models: Increasing the Costs of Harmful Dual Uses of Foundation Models
AAAI/ACM Conference on AI, Ethics, and Society (AIES), 2022
Peter Henderson
E. Mitchell
Christopher D. Manning
Dan Jurafsky
Chelsea Finn
160
62
0
27 Nov 2022
If Influence Functions are the Answer, Then What is the Question?
Neural Information Processing Systems (NeurIPS), 2022
Juhan Bae
Nathan Ng
Alston Lo
Marzyeh Ghassemi
Roger C. Grosse
TDI
263
136
0
12 Sep 2022
Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Xingyu Xie
Pan Zhou
Huan Li
Zhouchen Lin
Shuicheng Yan
ODL
357
233
0
13 Aug 2022
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai
Andy Jones
Kamal Ndousse
Amanda Askell
Anna Chen
...
Jack Clark
Sam McCandlish
C. Olah
Benjamin Mann
Jared Kaplan
769
3,393
0
12 Apr 2022
Training language models to follow instructions with human feedback
Neural Information Processing Systems (NeurIPS), 2022
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
1.9K
16,931
0
04 Mar 2022
Fast Yet Effective Machine Unlearning
Ayush K Tarun
Vikram S Chundawat
Murari Mandal
Mohan S. Kankanhalli
MU
425
249
0
17 Nov 2021
ZeRO-Offload: Democratizing Billion-Scale Model Training
USENIX Annual Technical Conference (USENIX ATC), 2021
Jie Ren
Samyam Rajbhandari
Reza Yazdani Aminabadi
Olatunji Ruwase
Shuangyang Yang
Minjia Zhang
Dong Li
Yuxiong He
MoE
417
513
0
18 Jan 2021
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
778
2,499
0
31 Dec 2020
The Radicalization Risks of GPT-3 and Advanced Neural Language Models
Kris McGuffie
Alex Newhouse
157
162
0
15 Sep 2020
Measuring Massive Multitask Language Understanding
International Conference on Learning Representations (ICLR), 2020
Dan Hendrycks
Collin Burns
Steven Basart
Andy Zou
Mantas Mazeika
Basel Alomair
Jacob Steinhardt
ELM
RALM
1.3K
6,294
0
07 Sep 2020
Previous
1
2
3
Next