ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2311.00117
13
23

BadLlama: cheaply removing safety fine-tuning from Llama 2-Chat 13B

31 October 2023
Pranav M. Gade
Simon Lermen
Charlie Rogers-Smith
Jeffrey Ladish
    ALM
    AI4MH
ArXivPDFHTML
Abstract

Llama 2-Chat is a collection of large language models that Meta developed and released to the public. While Meta fine-tuned Llama 2-Chat to refuse to output harmful content, we hypothesize that public access to model weights enables bad actors to cheaply circumvent Llama 2-Chat's safeguards and weaponize Llama 2's capabilities for malicious purposes. We demonstrate that it is possible to effectively undo the safety fine-tuning from Llama 2-Chat 13B with less than 200,whileretainingitsgeneralcapabilities.Ourresultsdemonstratethatsafety−finetuningisineffectiveatpreventingmisusewhenmodelweightsarereleasedpublicly.Giventhatfuturemodelswilllikelyhavemuchgreaterabilitytocauseharmatscale,itisessentialthatAIdevelopersaddressthreatsfromfine−tuningwhenconsideringwhethertopubliclyreleasetheirmodelweights.200, while retaining its general capabilities. Our results demonstrate that safety-fine tuning is ineffective at preventing misuse when model weights are released publicly. Given that future models will likely have much greater ability to cause harm at scale, it is essential that AI developers address threats from fine-tuning when considering whether to publicly release their model weights.200,whileretainingitsgeneralcapabilities.Ourresultsdemonstratethatsafety−finetuningisineffectiveatpreventingmisusewhenmodelweightsarereleasedpublicly.Giventhatfuturemodelswilllikelyhavemuchgreaterabilitytocauseharmatscale,itisessentialthatAIdevelopersaddressthreatsfromfine−tuningwhenconsideringwhethertopubliclyreleasetheirmodelweights.

View on arXiv
Comments on this paper