Multi-Preference Lambda-weighted Listwise DPO for Dynamic Preference Alignment

24 June 2025

Yuhui Sun

13 Figures

2 Tables

Appendix:12 Pages

Abstract

While large-scale unsupervised language models (LMs) capture broad world knowledge and reasoning capabilities, steering their behavior toward desired objectives remains challenging due to the lack of explicit supervision. Existing alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on training a reward model and performing reinforcement learning to align with human preferences. However, RLHF is often computationally intensive, unstable, and sensitive to hyperparameters.

View on arXiv

Comments on this paper