C-3DPO: Constrained Controlled Classification for Direct Preference Optimization

Direct preference optimization (DPO)-style algorithms have emerged as a promising approach for solving the alignment problem in AI. We present a novel perspective that formulates these algorithms as implicit classification algorithms. This classification framework enables us to recover many variants of DPO-style algorithms by choosing appropriate classification labels and loss functions. We then leverage this classification framework to demonstrate that the underlying problem solved in these algorithms is under-specified, making them susceptible to probability collapse of the winner-loser responses. We address this by proposing a set of constraints designed to control the movement of probability mass between the winner and loser in the reference and target policies. Our resulting algorithm, which we call Constrained Controlled Classification DPO (\texttt{C-3DPO}), has a meaningful RLHF interpretation. By hedging against probability collapse, \texttt{C-3DPO} provides practical improvements over vanilla \texttt{DPO} when aligning several large language models using standard preference datasets.
View on arXiv@article{asadi2025_2502.17507, title={ C-3DPO: Constrained Controlled Classification for Direct Preference Optimization }, author={ Kavosh Asadi and Julien Han and Xingzi Xu and Dominique Perrault-Joncas and Shoham Sabach and Karim Bouyarmane and Mohammad Ghavamzadeh }, journal={arXiv preprint arXiv:2502.17507}, year={ 2025 } }