ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2409.19605
35
7

The Crucial Role of Samplers in Online Direct Preference Optimization

29 September 2024
Ruizhe Shi
Runlong Zhou
Simon S. Du
ArXivPDFHTML
Abstract

Direct Preference Optimization (DPO) has emerged as a stable, scalable, and efficient solution for language model alignment. Despite its empirical success, the optimization properties, particularly the impact of samplers on its convergence rates, remain under-explored. In this paper, we provide a rigorous analysis of DPO's convergence rates with different sampling strategies under the exact gradient setting, revealing a surprising separation: uniform sampling achieves linear\textbf{linear}linear convergence, while our proposed online sampler achieves quadratic\textbf{quadratic}quadratic convergence. We further adapt the sampler to practical settings by incorporating posterior distributions and logit mixing, demonstrating improvements over previous methods. For example, it outperforms vanilla DPO by over 7.47.47.4% on Safe-RLHF dataset. Our results not only offer insights into the theoretical understanding of DPO but also pave the way for further algorithm designs.

View on arXiv
@article{shi2025_2409.19605,
  title={ The Crucial Role of Samplers in Online Direct Preference Optimization },
  author={ Ruizhe Shi and Runlong Zhou and Simon S. Du },
  journal={arXiv preprint arXiv:2409.19605},
  year={ 2025 }
}
Comments on this paper