Probabilistic Uncertain Reward Model: A Natural Generalization of Bradley-Terry Reward Model

Reinforcement Learning from Human Feedback (RLHF) has emerged as a critical technique for training large language models. However, reward hacking-a phenomenon where models exploit flaws in the reward model-remains a significant barrier to achieving robust and scalable intelligence through long-term training. Existing studies have proposed uncertain reward model to address reward hacking, however, they often lack systematic or theoretical foundations, failing to model the uncertainty intrinsically emerging from preference data. In this paper, we propose the Probabilistic Uncertain Reward Model (PURM), a natural generalization of the classical Bradley-Terry reward model. PURM learns reward distributions directly from preference data and quantifies per-sample uncertainty via the average overlap area between reward distributions. To mitigate reward hacking, we further introduce an uncertainty-aware penalty into Proximal Policy Optimization (PPO), which leverages the learned uncertainty to dynamically balance reward optimization and exploration. We propose a lightweight and easy-to-use implementation of PURM. Experiments demonstrate that PURM significantly delays the onset of reward hacking while improving final reward performance, outperforming baseline methods in both stability and effectiveness.
View on arXiv@article{sun2025_2503.22480, title={ Probabilistic Uncertain Reward Model: A Natural Generalization of Bradley-Terry Reward Model }, author={ Wangtao Sun and Xiang Cheng and Xing Yu and Haotian Xu and Zhao Yang and Shizhu He and Jun Zhao and Kang Liu }, journal={arXiv preprint arXiv:2503.22480}, year={ 2025 } }