ADPO: Anchored Direct Preference Optimization

Direct Preference Optimization (DPO) has become a standard for aligning models with human feedback, yet its reliance on hard, pairwise preferences makes it brittle to annotator noise and distribution shift. We propose Anchored Direct Preference Optimization (ADPO), a theoretically grounded framework that extends preference learning to soft, listwise supervision through reference anchoring. Our key theoretical contributions are threefold: (1) we establish that ADPO unifies major learning paradigms, including supervised fine-tuning, knowledge distillation, maximum-entropy reinforcement learning, and DPO, as special cases through different choices of target distribution, anchor policy, and temperature; (2) we prove that anchoring induces an implicit trust region governed by the softmax Fisher metric; and (3) we formalize the stability of dynamic anchor updates. Empirically, we discover a task-dependent tradeoff: dynamic anchors suit online exploration, while fixed anchors excel at offline distillation, reducing teacher-student KL divergence by two to three orders of magnitude (170 to 5000 times).
View on arXiv