ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.09420
25
0

SaRO: Enhancing LLM Safety through Reasoning-based Alignment

13 April 2025
Yutao Mou
Yuxiao Luo
Shikun Zhang
Wei Ye
    LLMSV
    LRM
ArXivPDFHTML
Abstract

Current safety alignment techniques for large language models (LLMs) face two key challenges: (1) under-generalization, which leaves models vulnerable to novel jailbreak attacks, and (2) over-alignment, which leads to the excessive refusal of benign instructions. Our preliminary investigation reveals semantic overlap between jailbreak/harmful queries and normal prompts in embedding space, suggesting that more effective safety alignment requires a deeper semantic understanding. This motivates us to incorporate safety-policy-driven reasoning into the alignment process. To this end, we propose the Safety-oriented Reasoning Optimization Framework (SaRO), which consists of two stages: (1) Reasoning-style Warmup (RW) that enables LLMs to internalize long-chain reasoning through supervised fine-tuning, and (2) Safety-oriented Reasoning Process Optimization (SRPO) that promotes safety reflection via direct preference optimization (DPO). Extensive experiments demonstrate the superiority of SaRO over traditional alignment methods.

View on arXiv
@article{mou2025_2504.09420,
  title={ SaRO: Enhancing LLM Safety through Reasoning-based Alignment },
  author={ Yutao Mou and Yuxiao Luo and Shikun Zhang and Wei Ye },
  journal={arXiv preprint arXiv:2504.09420},
  year={ 2025 }
}
Comments on this paper