v1v2 (latest)
When Are Two RLHF Objectives the Same?
Main:12 Pages
1 Figures
Bibliography:3 Pages
5 Tables
Appendix:6 Pages
Abstract
The preference optimization literature contains many proposed objectives, often presented as distinct improvements. We introduce Opal, a canonicalization algorithm that determines whether two preference objectives are algebraically equivalent by producing either a canonical form or a concrete witness of non-equivalence. Applying Opal reveals that many widely used methods optimize the same underlying objective, while others are provably distinct. For example, batch normalization can cause the same response pair to receive different gradients depending on batch composition. We identify a small set of structural mechanisms that give rise to genuinely different objectives; most remaining differences are reparameterizations.
View on arXivComments on this paper
