The Crucial Role of Samplers in Online Direct Preference Optimization

29 September 2024

Simon S. Du

Abstract

Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient solution for language model alignment. Despite its empirical success, the optimization properties, particularly the impact of samplers on its convergence rates, remain under-explored. In this paper, we provide a rigorous analysis of DPO's convergence rates with different sampling strategies under the exact gradient setting, revealing a surprising separation: uniform sampling achieves $\textbf{linear}$ convergence, while our proposed online sampler achieves $\textbf{quadratic}$ convergence. We further adapt the sampler to practical settings by incorporating posterior distributions and logit mixing, demonstrating improvements over previous methods. For example, it outperforms vanilla DPO by over $7.4$ % on Safe-RLHF dataset. Our results not only offer insights into the theoretical understanding of DPO but also pave the way for further algorithm designs.

View on arXiv

@article{shi2025_2409.19605,
  title={ The Crucial Role of Samplers in Online Direct Preference Optimization },
  author={ Ruizhe Shi and Runlong Zhou and Simon S. Du },
  journal={arXiv preprint arXiv:2409.19605},
  year={ 2025 }
}

Comments on this paper