ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.10093
50
0

Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model

13 March 2025
Qiyuan Deng
X. Bai
Kehai Chen
Yaowei Wang
Liqiang Nie
Min Zhang
    OffRL
ArXivPDFHTML
Abstract

Reinforcement Learning (RL) algorithms for safety alignment of Large Language Models (LLMs), such as Direct Preference Optimization (DPO), encounter the challenge of distribution shift. Current approaches typically address this issue through online sampling from the target policy, which requires significant computational resources. In this paper, we hypothesize that during off-policy training, while the ranking order of output generated by policy changes, their overall distribution remains relatively stable. This stability allows the transformation of the sampling process from the target policy into a re-ranking of preference data. Building on this hypothesis, We propose a new framework that leverages the model's intrinsic safety judgment capability to extract reward signals, which are then used to calculate label confidence for preferences reordering. Extensive experimental results and theoretical analysis demonstrate that the proposed method effectively addresses the distribution shift issue, remarkably enhancing the safety performance while reducing about 300x computational overheads.

View on arXiv
@article{deng2025_2503.10093,
  title={ Representation-based Reward Modeling for Efficient Safety Alignment of Large Language Model },
  author={ Qiyuan Deng and Xuefeng Bai and Kehai Chen and Yaowei Wang and Liqiang Nie and Min Zhang },
  journal={arXiv preprint arXiv:2503.10093},
  year={ 2025 }
}
Comments on this paper