Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2403.03218
Cited By
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
5 March 2024
Nathaniel Li
Alexander Pan
Anjali Gopal
Summer Yue
Daniel Berrios
Alice Gatti
Justin D. Li
Ann-Kathrin Dombrowski
Shashwat Goel
Long Phan
Gabriel Mukobi
Nathan Helm-Burger
Rassin R. Lababidi
Lennart Justen
Andrew B. Liu
Michael Chen
Isabelle Barrass
Oliver Zhang
Xiaoyuan Zhu
Rishub Tamirisa
Bhrugu Bharathi
Adam Khoja
Zhenqi Zhao
Ariel Herbert-Voss
Cort B. Breuer
Samuel Marks
Oam Patel
Andy Zou
Mantas Mazeika
Zifan Wang
Palash Oswal
Weiran Liu
Adam A. Hunt
Justin Tienken-Harder
Kevin Y. Shih
Kemper Talley
John Guan
Russell Kaplan
Ian Steneker
David Campbell
Brad Jokubaitis
Alex Levinson
Jean Wang
William Qian
K. Karmakar
Steven Basart
Stephen Fitz
Mindy Levine
Ponnurangam Kumaraguru
U. Tupakula
Vijay Varadharajan
Ruoyu Wang
Yan Shoshitaishvili
Jimmy Ba
K. Esvelt
Alexandr Wang
Dan Hendrycks
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning"
35 / 35 papers shown
Title
Rethinking Memory in AI: Taxonomy, Operations, Topics, and Future Directions
Yiming Du
Wenyu Huang
Danna Zheng
Zhaowei Wang
Sébastien Montella
Mirella Lapata
Kam-Fai Wong
Jeff Z. Pan
KELM
MU
62
1
0
01 May 2025
Unlearning Sensitive Information in Multimodal LLMs: Benchmark and Attack-Defense Evaluation
Vaidehi Patil
Yi-Lin Sung
Peter Hase
Jie Peng
Tianlong Chen
Mohit Bansal
AAML
MU
77
3
0
01 May 2025
Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark
Jasper Götting
Pedro Medeiros
Jon G Sanders
Nathaniel Li
Long Phan
Karam Elabd
Lennart Justen
Dan Hendrycks
Seth Donoughe
ELM
45
2
0
21 Apr 2025
Not All Data Are Unlearned Equally
Aravind Krishnan
Siva Reddy
Marius Mosbach
MU
34
0
0
07 Apr 2025
UIPE: Enhancing LLM Unlearning by Removing Knowledge Related to Forgetting Targets
Wenyu Wang
M. Zhang
Xiaotian Ye
Z. Z. Ren
Z. Chen
Pengjie Ren
MU
KELM
62
0
0
06 Mar 2025
LED-Merging: Mitigating Safety-Utility Conflicts in Model Merging with Location-Election-Disjoint
Qianli Ma
Dongrui Liu
Qian Chen
Linfeng Zhang
Jing Shao
MoMe
47
0
0
24 Feb 2025
ReLearn: Unlearning via Learning for Large Language Models
Haoming Xu
Ningyuan Zhao
Liming Yang
Sendong Zhao
Shumin Deng
Mengru Wang
Bryan Hooi
Nay Oo
H. Chen
N. Zhang
KELM
CLL
MU
46
0
0
16 Feb 2025
SEER: Self-Explainability Enhancement of Large Language Models' Representations
Guanxu Chen
Dongrui Liu
Tao Luo
Jing Shao
LRM
MILM
51
1
0
07 Feb 2025
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Zora Che
Stephen Casper
Robert Kirk
Anirudh Satheesh
Stewart Slocum
...
Zikui Cai
Bilal Chughtai
Y. Gal
Furong Huang
Dylan Hadfield-Menell
MU
AAML
ELM
60
2
0
03 Feb 2025
Episodic memory in AI agents poses risks that should be studied and mitigated
Chad DeChant
43
1
0
20 Jan 2025
Benchmarking Vision Language Model Unlearning via Fictitious Facial Identity Dataset
Yingzi Ma
Jiongxiao Wang
Fei-Yue Wang
Siyuan Ma
Jiazhao Li
...
B. Li
Yejin Choi
M. Chen
Chaowei Xiao
Chaowei Xiao
MU
37
6
0
05 Nov 2024
RESTOR: Knowledge Recovery through Machine Unlearning
Keivan Rezaei
Khyathi Raghavi Chandu
S. Feizi
Yejin Choi
Faeze Brahman
Abhilasha Ravichander
KELM
CLL
MU
47
0
0
31 Oct 2024
WAGLE: Strategic Weight Attribution for Effective and Modular Unlearning in Large Language Models
Jinghan Jia
Jiancheng Liu
Yihua Zhang
Parikshit Ram
Nathalie Baracaldo
Sijia Liu
MU
30
2
0
23 Oct 2024
LabSafety Bench: Benchmarking LLMs on Safety Issues in Scientific Labs
Yujun Zhou
Jingdong Yang
Kehan Guo
Pin-Yu Chen
Tian Gao
...
Tian Gao
Werner Geyer
Nuno Moniz
Nitesh V Chawla
Xiangliang Zhang
23
4
0
18 Oct 2024
Do Unlearning Methods Remove Information from Language Model Weights?
Aghyad Deeb
Fabien Roger
AAML
MU
32
11
0
11 Oct 2024
Simplicity Prevails: Rethinking Negative Preference Optimization for LLM Unlearning
Chongyu Fan
Jiancheng Liu
Licong Lin
Jinghan Jia
Ruiqi Zhang
Song Mei
Sijia Liu
MU
29
15
0
09 Oct 2024
A Probabilistic Perspective on Unlearning and Alignment for Large Language Models
Yan Scholten
Stephan Günnemann
Leo Schwinn
MU
38
6
0
04 Oct 2024
Position: LLM Unlearning Benchmarks are Weak Measures of Progress
Pratiksha Thaker
Shengyuan Hu
Neil Kale
Yash Maurya
Zhiwei Steven Wu
Virginia Smith
MU
28
10
0
03 Oct 2024
Erasing Conceptual Knowledge from Language Models
Rohit Gandikota
Sheridan Feucht
Samuel Marks
David Bau
KELM
ELM
MU
32
5
0
03 Oct 2024
Tamper-Resistant Safeguards for Open-Weight LLMs
Rishub Tamirisa
Bhrugu Bharathi
Long Phan
Andy Zhou
Alice Gatti
...
Andy Zou
Dawn Song
Bo Li
Dan Hendrycks
Mantas Mazeika
AAML
MU
27
36
0
01 Aug 2024
Learn while Unlearn: An Iterative Unlearning Framework for Generative Language Models
Haoyu Tang
Ye Liu
Xukai Liu
Xukai Liu
Yanghai Zhang
Kai Zhang
Xiaofang Zhou
Enhong Chen
MU
35
3
0
25 Jul 2024
Composable Interventions for Language Models
Arinbjorn Kolbeinsson
Kyle O'Brien
Tianjin Huang
Shanghua Gao
Shiwei Liu
...
Anurag J. Vaidya
Faisal Mahmood
Marinka Zitnik
Tianlong Chen
Thomas Hartvigsen
KELM
MU
57
5
0
09 Jul 2024
PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models
Huixuan Zhang
Yun Lin
Xiaojun Wan
26
0
0
26 Jun 2024
AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij
Felix Hofstätter
Ollie Jaffe
Samuel F. Brown
Francis Rhys Ward
ELM
22
22
0
11 Jun 2024
Soft Prompt Threats: Attacking Safety Alignment and Unlearning in Open-Source LLMs through the Embedding Space
Leo Schwinn
David Dobre
Sophie Xhonneux
Gauthier Gidel
Stephan Gunnemann
AAML
23
36
0
14 Feb 2024
Feedback Loops With Language Models Drive In-Context Reward Hacking
Alexander Pan
Erik Jones
Meena Jagadeesan
Jacob Steinhardt
KELM
34
25
0
09 Feb 2024
PaperQA: Retrieval-Augmented Generative Agent for Scientific Research
Jakub Lála
Odhran O'Donoghue
Aleksandar Shtedritski
Sam Cox
Samuel G. Rodriques
Andrew D. White
RALM
64
66
0
08 Dec 2023
Will releasing the weights of future large language models grant widespread access to pandemic agents?
Anjali Gopal
Nathan Helm-Burger
Lenni Justen
Emily H. Soice
Tiffany Tzeng
Geetha Jeyapragasan
Simon Grimm
Benjamin Mueller
K. Esvelt
23
13
0
25 Oct 2023
Who's Harry Potter? Approximate Unlearning in LLMs
Ronen Eldan
M. Russinovich
MU
MoMe
98
171
0
03 Oct 2023
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang
Alexandre Variengien
Arthur Conmy
Buck Shlegeris
Jacob Steinhardt
205
486
0
01 Nov 2022
Knowledge Unlearning for Mitigating Privacy Risks in Language Models
Joel Jang
Dongkeun Yoon
Sohee Yang
Sungmin Cha
Moontae Lee
Lajanugen Logeswaran
Minjoon Seo
KELM
PILM
MU
139
110
0
04 Oct 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
Unsolved Problems in ML Safety
Dan Hendrycks
Nicholas Carlini
John Schulman
Jacob Steinhardt
156
268
0
28 Sep 2021
Gradient-based Adversarial Attacks against Text Transformers
Chuan Guo
Alexandre Sablayrolles
Hervé Jégou
Douwe Kiela
SILM
90
162
0
15 Apr 2021
Fine-Tuning Language Models from Human Preferences
Daniel M. Ziegler
Nisan Stiennon
Jeff Wu
Tom B. Brown
Alec Radford
Dario Amodei
Paul Christiano
G. Irving
ALM
273
1,561
0
18 Sep 2019
1