AI developers often apply safety alignment procedures to prevent the misuse
of their AI systems. For example, before Meta released Llama 2-Chat - a
collection of instruction fine-tuned large language models - they invested
heavily in safety training, incorporating extensive red-teaming and
reinforcement learning from human feedback. We explore the robustness of safety
training in language models by subversively fine-tuning Llama 2-Chat. We employ
quantized low-rank adaptation (LoRA) as an efficient fine-tuning method. With a
budget of less than \200andusingonlyoneGPU,wesuccessfullyundothesafetytrainingofLlama2−Chatmodelsofsizes7B,13B,and70BandontheMixtralinstructmodel.Specifically,ourfine−tuningtechniquesignificantlyreducestherateatwhichthemodelrefusestofollowharmfulinstructions.Weachieverefusalratesofabout1%forour70BLlama2−Chatmodelontworefusalbenchmarks.Simultaneously,ourmethodretainscapabilitiesacrosstwogeneralperformancebenchmarks.Weshowthatsubversivefine−tuningispracticalandeffective,andhencearguethatevaluatingrisksfromfine−tuningshouldbeacorepartofriskassessmentsforreleasingmodelweights.Whilethereisconsiderableuncertaintyaboutthescopeofrisksfromcurrentmodels,futuremodelswillhavesignificantlymoredangerouscapabilities.