49
v1v2 (latest)

IRPM: Intergroup Relative Preference Modeling for Pointwise Generative Reward Models

Haonan Song
Qingchen Xie
Huan Zhu
Feng Xiao
Luxi Xing
Liu Kang
Fuzhen Li
Zhiyong Zheng
Feng Jiang
Ziheng Li
Kun Yan
Qingyi Si
Yanghua Xiao
Hongcheng Guo
Fan Yang
Main:2 Pages
8 Figures
6 Tables
Appendix:22 Pages
Abstract

Generative Reward Models (GRMs) have demonstrated strong performance in reward modeling, due to their interpretability and potential for refinement through reinforcement learning (RL). However, widely used pairwise GRMs create a computational bottleneck in reinforcement learning from human feedback (RLHF), when calibrating or aggregating preference signals over n candidates, often incurring O(n^2) pairwise judgments. To address this issue, we propose Intergroup Relative Preference Modeling (IRPM), an RL-based method that extends the Bradley--Terry preference-learning paradigm via intergroup comparisons to train pointwise GRMs from pairwise preference data. IRPM derives pointwise reward for each response by contrasting groups of chosen vs. rejected samples, enabling pointwise scores comparable across candidate sets and O(n) reward evaluation for a variable number of candidates during RL training, while preserving interpretability and scalability. Experiments show that IRPM achieves state-of-the-art performance among pointwise GRMs on RM-Bench, JudgeBench and RewardBench, and approaches the performance of leading pairwise GRMs. In addition, IRPM achieves substantial gains in post-training evaluations, demonstrating its effectiveness.

View on arXiv
Comments on this paper