Multi-Preference Lambda-weighted Listwise DPO for Dynamic Preference Alignment
13 Figures
2 Tables
Appendix:12 Pages
Abstract
While large-scale unsupervised language models (LMs) capture broad world knowledge and reasoning capabilities, steering their behavior toward desired objectives remains challenging due to the lack of explicit supervision. Existing alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on training a reward model and performing reinforcement learning to align with human preferences. However, RLHF is often computationally intensive, unstable, and sensitive to hyperparameters.
View on arXivComments on this paper
