RePO: ReLU-based Preference Optimization
Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single hyperparameter , subsequent methods like SimPO reintroduce complexity through dual parameters (, ). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates via two advances: (1) retaining SimPO's reference-free margins but removing through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO's limiting case (), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.
View on arXiv@article{wu2025_2503.07426, title={ RePO: ReLU-based Preference Optimization }, author={ Junkang Wu and Kexin Huang and Xue Wang and Jinyang Gao and Bolin Ding and Jiancan Wu and Xiangnan He and Xiang Wang }, journal={arXiv preprint arXiv:2503.07426}, year={ 2025 } }