ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.20624
7
80

LoRA Fine-tuning Efficiently Undoes Safety Training in Llama 2-Chat 70B

31 October 2023
Simon Lermen
Charlie Rogers-Smith
Jeffrey Ladish
    ALM
ArXivPDFHTML
Abstract

AI developers often apply safety alignment procedures to prevent the misuse of their AI systems. For example, before Meta released Llama 2-Chat - a collection of instruction fine-tuned large language models - they invested heavily in safety training, incorporating extensive red-teaming and reinforcement learning from human feedback. We explore the robustness of safety training in language models by subversively fine-tuning Llama 2-Chat. We employ quantized low-rank adaptation (LoRA) as an efficient fine-tuning method. With a budget of less than \200andusingonlyoneGPU,wesuccessfullyundothesafetytrainingofLlama2−Chatmodelsofsizes7B,13B,and70BandontheMixtralinstructmodel.Specifically,ourfine−tuningtechniquesignificantlyreducestherateatwhichthemodelrefusestofollowharmfulinstructions.Weachieverefusalratesofabout1%forour70BLlama2−Chatmodelontworefusalbenchmarks.Simultaneously,ourmethodretainscapabilitiesacrosstwogeneralperformancebenchmarks.Weshowthatsubversivefine−tuningispracticalandeffective,andhencearguethatevaluatingrisksfromfine−tuningshouldbeacorepartofriskassessmentsforreleasingmodelweights.Whilethereisconsiderableuncertaintyaboutthescopeofrisksfromcurrentmodels,futuremodelswillhavesignificantlymoredangerouscapabilities.200 and using only one GPU, we successfully undo the safety training of Llama 2-Chat models of sizes 7B, 13B, and 70B and on the Mixtral instruct model. Specifically, our fine-tuning technique significantly reduces the rate at which the model refuses to follow harmful instructions. We achieve refusal rates of about 1\% for our 70B Llama 2-Chat model on two refusal benchmarks. Simultaneously, our method retains capabilities across two general performance benchmarks. We show that subversive fine-tuning is practical and effective, and hence argue that evaluating risks from fine-tuning should be a core part of risk assessments for releasing model weights. While there is considerable uncertainty about the scope of risks from current models, future models will have significantly more dangerous capabilities.200andusingonlyoneGPU,wesuccessfullyundothesafetytrainingofLlama2−Chatmodelsofsizes7B,13B,and70BandontheMixtralinstructmodel.Specifically,ourfine−tuningtechniquesignificantlyreducestherateatwhichthemodelrefusestofollowharmfulinstructions.Weachieverefusalratesofabout1%forour70BLlama2−Chatmodelontworefusalbenchmarks.Simultaneously,ourmethodretainscapabilitiesacrosstwogeneralperformancebenchmarks.Weshowthatsubversivefine−tuningispracticalandeffective,andhencearguethatevaluatingrisksfromfine−tuningshouldbeacorepartofriskassessmentsforreleasingmodelweights.Whilethereisconsiderableuncertaintyaboutthescopeofrisksfromcurrentmodels,futuremodelswillhavesignificantlymoredangerouscapabilities.

View on arXiv
Comments on this paper