Optimizing large language models (LLMs) for downstream use cases often
involves the customization of pre-trained LLMs through further fine-tuning.
Meta's open release of Llama models and OpenAI's APIs for fine-tuning GPT-3.5
Turbo on custom datasets also encourage this practice. But, what are the safety
costs associated with such custom fine-tuning? We note that while existing
safety alignment infrastructures can restrict harmful behaviors of LLMs at
inference time, they do not cover safety risks when fine-tuning privileges are
extended to end-users. Our red teaming studies find that the safety alignment
of LLMs can be compromised by fine-tuning with only a few adversarially
designed training examples. For instance, we jailbreak GPT-3.5 Turbo's safety
guardrails by fine-tuning it on only 10 such examples at a cost of less than
0.20viaOpenAI′sAPIs,makingthemodelresponsivetonearlyanyharmfulinstructions.Disconcertingly,ourresearchalsorevealsthat,evenwithoutmaliciousintent,simplyfine−tuningwithbenignandcommonlyuseddatasetscanalsoinadvertentlydegradethesafetyalignmentofLLMs,thoughtoalesserextent.Thesefindingssuggestthatfine−tuningalignedLLMsintroducesnewsafetyrisksthatcurrentsafetyinfrastructuresfallshortofaddressing−−evenifamodel′sinitialsafetyalignmentisimpeccable,itisnotnecessarilytobemaintainedaftercustomfine−tuning.Weoutlineandcriticallyanalyzepotentialmitigationsandadvocateforfurtherresearcheffortstowardreinforcingsafetyprotocolsforthecustomfine−tuningofalignedLLMs.