Llama 2-Chat is a collection of large language models that Meta developed and
released to the public. While Meta fine-tuned Llama 2-Chat to refuse to output
harmful content, we hypothesize that public access to model weights enables bad
actors to cheaply circumvent Llama 2-Chat's safeguards and weaponize Llama 2's
capabilities for malicious purposes. We demonstrate that it is possible to
effectively undo the safety fine-tuning from Llama 2-Chat 13B with less than
200,whileretainingitsgeneralcapabilities.Ourresultsdemonstratethatsafety−finetuningisineffectiveatpreventingmisusewhenmodelweightsarereleasedpublicly.Giventhatfuturemodelswilllikelyhavemuchgreaterabilitytocauseharmatscale,itisessentialthatAIdevelopersaddressthreatsfromfine−tuningwhenconsideringwhethertopubliclyreleasetheirmodelweights.