v1v2 (latest)

IRPM: Intergroup Relative Preference Modeling for Pointwise Generative Reward Models

2 January 2026

Haonan Song

Qingchen Xie

Huan Zhu

Feng Xiao

Luxi Xing

Liu Kang

Fuzhen Li

Zhiyong Zheng

Feng Jiang

Ziheng Li

Kun Yan

Qingyi Si

Yanghua Xiao

Hongcheng Guo

Fan Yang

ArXiv (abs)PDF HTML Github

Main:2 Pages

8 Figures

6 Tables

Appendix:22 Pages

Abstract

Generative Reward Models (GRMs) have demonstrated strong performance in reward modeling, due to their interpretability and potential for refinement through reinforcement learning (RL). However, widely used pairwise GRMs create a computational bottleneck in reinforcement learning from human feedback (RLHF), when calibrating or aggregating preference signals over n candidates, often incurring O(n^2) pairwise judgments. To address this issue, we propose Intergroup Relative Preference Modeling (IRPM), an RL-based method that extends the Bradley--Terry preference-learning paradigm via intergroup comparisons to train pointwise GRMs from pairwise preference data. IRPM derives pointwise reward for each response by contrasting groups of chosen vs. rejected samples, enabling pointwise scores comparable across candidate sets and O(n) reward evaluation for a variable number of candidates during RL training, while preserving interpretability and scalability. Experiments show that IRPM achieves state-of-the-art performance among pointwise GRMs on RM-Bench, JudgeBench and RewardBench, and approaches the performance of leading pairwise GRMs. In addition, IRPM achieves substantial gains in post-training evaluations, demonstrating its effectiveness.

View on arXiv

Comments on this paper