ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2403.03218
32
130

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

5 March 2024
Nathaniel Li
Alexander Pan
Anjali Gopal
Summer Yue
Daniel Berrios
Alice Gatti
Justin D. Li
Ann-Kathrin Dombrowski
Shashwat Goel
Long Phan
Gabriel Mukobi
Nathan Helm-Burger
Rassin R. Lababidi
Lennart Justen
Andrew B. Liu
Michael Chen
Isabelle Barrass
Oliver Zhang
Xiaoyuan Zhu
Rishub Tamirisa
Bhrugu Bharathi
Adam Khoja
Zhenqi Zhao
Ariel Herbert-Voss
Cort B. Breuer
Samuel Marks
Oam Patel
Andy Zou
Mantas Mazeika
Zifan Wang
Palash Oswal
Weiran Liu
Adam A. Hunt
Justin Tienken-Harder
Kevin Y. Shih
Kemper Talley
John Guan
Russell Kaplan
Ian Steneker
David Campbell
Brad Jokubaitis
Alex Levinson
Jean Wang
William Qian
K. Karmakar
Steven Basart
Stephen Fitz
Mindy Levine
Ponnurangam Kumaraguru
U. Tupakula
Vijay Varadharajan
Ruoyu Wang
Yan Shoshitaishvili
Jimmy Ba
K. Esvelt
Alexandr Wang
Dan Hendrycks
    ELM
ArXivPDFHTML
Abstract

The White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. To measure these risks of malicious use, government institutions and major AI labs are developing evaluations for hazardous capabilities in LLMs. However, current evaluations are private, preventing further research into mitigating risk. Furthermore, they focus on only a few, highly specific pathways for malicious use. To fill these gaps, we publicly release the Weapons of Mass Destruction Proxy (WMDP) benchmark, a dataset of 3,668 multiple-choice questions that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security. WMDP was developed by a consortium of academics and technical consultants, and was stringently filtered to eliminate sensitive information prior to public release. WMDP serves two roles: first, as an evaluation for hazardous knowledge in LLMs, and second, as a benchmark for unlearning methods to remove such hazardous knowledge. To guide progress on unlearning, we develop RMU, a state-of-the-art unlearning method based on controlling model representations. RMU reduces model performance on WMDP while maintaining general capabilities in areas such as biology and computer science, suggesting that unlearning may be a concrete path towards reducing malicious use from LLMs. We release our benchmark and code publicly at https://wmdp.ai

View on arXiv
Comments on this paper