203

Multi-Preference Lambda-weighted Listwise DPO for Dynamic Preference Alignment

13 Figures
2 Tables
Appendix:12 Pages
Abstract

While large language models (LLMs) excel at text generation, aligning them with human preferences remains challenging. Reinforcement learning from human feedback (RLHF) improves alignment but is costly and unstable. Direct Preference Optimization (DPO) offers a simpler alternative, yet assumes a fixed, single-dimensional preference. We propose Multi-Preference Lambda-weighted Listwise DPO, a generalization of DPO that supports multiple preference dimensions and dynamic interpolation via a simplex-weighted lambda vector. Our method enables listwise supervision and flexible alignment without re-training. While our experiments are conducted on 1B-2B scale models, this is an intentional choice: smaller models provide a more stringent testbed where performance improvements more clearly reflect the effectiveness of the alignment strategy itself. Moreover, such models are widely used in compute-constrained applications, making our improvements both methodologically meaningful and practically valuable. Empirical results show that our approach matches or surpasses standard DPO on alignment benchmarks while offering improved adaptability.

View on arXiv
Comments on this paper