RePO: ReLU-based Preference Optimization

10 March 2025

Abstract

Aligning large language models (LLMs) with human preferences is critical for real-world deployment, yet existing methods like RLHF face computational and stability challenges. While DPO establishes an offline paradigm with single hyperparameter $\beta$ , subsequent methods like SimPO reintroduce complexity through dual parameters ( $\beta$ , $\gamma$ ). We propose {ReLU-based Preference Optimization (RePO)}, a streamlined algorithm that eliminates $\beta$ via two advances: (1) retaining SimPO's reference-free margins but removing $\beta$ through gradient analysis, and (2) adopting a ReLU-based max-margin loss that naturally filters trivial pairs. Theoretically, RePO is characterized as SimPO's limiting case ( $\beta \to \infty$ ), where the logistic weighting collapses to binary thresholding, forming a convex envelope of the 0-1 loss. Empirical results on AlpacaEval 2 and Arena-Hard show that RePO outperforms DPO and SimPO across multiple base models, requiring only one hyperparameter to tune.

View on arXiv

@article{wu2025_2503.07426,
  title={ RePO: ReLU-based Preference Optimization },
  author={ Junkang Wu and Kexin Huang and Xue Wang and Jinyang Gao and Bolin Ding and Jiancan Wu and Xiangnan He and Xiang Wang },
  journal={arXiv preprint arXiv:2503.07426},
  year={ 2025 }
}

Comments on this paper