200

Multi-Preference Lambda-weighted Listwise DPO for Dynamic Preference Alignment

13 Figures
2 Tables
Appendix:12 Pages
Abstract

While large-scale unsupervised language models (LMs) capture broad world knowledge and reasoning capabilities, steering their behavior toward desired objectives remains challenging due to the lack of explicit supervision. Existing alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on training a reward model and performing reinforcement learning to align with human preferences. However, RLHF is often computationally intensive, unstable, and sensitive to hyperparameters.

View on arXiv
Comments on this paper