ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.03693
12
520

Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!

5 October 2023
Xiangyu Qi
Yi Zeng
Tinghao Xie
Pin-Yu Chen
Ruoxi Jia
Prateek Mittal
Peter Henderson
    SILM
ArXivPDFHTML
Abstract

Optimizing large language models (LLMs) for downstream use cases often involves the customization of pre-trained LLMs through further fine-tuning. Meta's open release of Llama models and OpenAI's APIs for fine-tuning GPT-3.5 Turbo on custom datasets also encourage this practice. But, what are the safety costs associated with such custom fine-tuning? We note that while existing safety alignment infrastructures can restrict harmful behaviors of LLMs at inference time, they do not cover safety risks when fine-tuning privileges are extended to end-users. Our red teaming studies find that the safety alignment of LLMs can be compromised by fine-tuning with only a few adversarially designed training examples. For instance, we jailbreak GPT-3.5 Turbo's safety guardrails by fine-tuning it on only 10 such examples at a cost of less than 0.20viaOpenAI′sAPIs,makingthemodelresponsivetonearlyanyharmfulinstructions.Disconcertingly,ourresearchalsorevealsthat,evenwithoutmaliciousintent,simplyfine−tuningwithbenignandcommonlyuseddatasetscanalsoinadvertentlydegradethesafetyalignmentofLLMs,thoughtoalesserextent.Thesefindingssuggestthatfine−tuningalignedLLMsintroducesnewsafetyrisksthatcurrentsafetyinfrastructuresfallshortofaddressing−−evenifamodel′sinitialsafetyalignmentisimpeccable,itisnotnecessarilytobemaintainedaftercustomfine−tuning.Weoutlineandcriticallyanalyzepotentialmitigationsandadvocateforfurtherresearcheffortstowardreinforcingsafetyprotocolsforthecustomfine−tuningofalignedLLMs.0.20 via OpenAI's APIs, making the model responsive to nearly any harmful instructions. Disconcertingly, our research also reveals that, even without malicious intent, simply fine-tuning with benign and commonly used datasets can also inadvertently degrade the safety alignment of LLMs, though to a lesser extent. These findings suggest that fine-tuning aligned LLMs introduces new safety risks that current safety infrastructures fall short of addressing -- even if a model's initial safety alignment is impeccable, it is not necessarily to be maintained after custom fine-tuning. We outline and critically analyze potential mitigations and advocate for further research efforts toward reinforcing safety protocols for the custom fine-tuning of aligned LLMs.0.20viaOpenAI′sAPIs,makingthemodelresponsivetonearlyanyharmfulinstructions.Disconcertingly,ourresearchalsorevealsthat,evenwithoutmaliciousintent,simplyfine−tuningwithbenignandcommonlyuseddatasetscanalsoinadvertentlydegradethesafetyalignmentofLLMs,thoughtoalesserextent.Thesefindingssuggestthatfine−tuningalignedLLMsintroducesnewsafetyrisksthatcurrentsafetyinfrastructuresfallshortofaddressing−−evenifamodel′sinitialsafetyalignmentisimpeccable,itisnotnecessarilytobemaintainedaftercustomfine−tuning.Weoutlineandcriticallyanalyzepotentialmitigationsandadvocateforfurtherresearcheffortstowardreinforcingsafetyprotocolsforthecustomfine−tuningofalignedLLMs.

View on arXiv
Comments on this paper