172
v1v2 (latest)

Understanding the Impact of Sampling Quality in Direct Preference Optimization

Main:10 Pages
9 Figures
Bibliography:2 Pages
3 Tables
Appendix:28 Pages
Abstract

We study how data of higher quality can be leveraged to improve performance in Direct Preference Optimization (DPO), aiming to understand its impact on DPO training dynamics. Our analyses show that both the solution space and the convergence behavior of DPO depend on the support and quality of the data-generating distribution. We first analyze how data and reference policy influence policy updates during gradient descent, and how a practical phenomenon known as likelihood displacement can interfere with the desired dynamics. We then design a simplified yet well-structured alignment model as a proxy that preserves most of the beneficial properties of RLHF while avoiding likelihood displacement. Based on this model, we develop quantitative results showing how more frequent high-quality responses amplify the gradient signal and improve the optimization landscape, leading to more effective policy learning. Our theoretical findings are supported by empirical experiments and provide a principled justification for the online DPO framework in practice.

View on arXiv
Comments on this paper