Stackelberg Game Preference Optimization for Data-Efficient Alignment of Language Models

Aligning language models with human preferences is critical for real-world deployment, but existing methods often require large amounts of high-quality human annotations. Aiming at a data-efficient alignment method, we propose Stackelberg Game Preference Optimization (SGPO), a framework that models alignment as a two-player Stackelberg game, where a policy (leader) optimizes against a worst-case preference distribution (follower) within an -Wasserstein ball, ensuring robustness to (self-)annotation noise and distribution shifts. SGPO guarantees -bounded regret, unlike Direct Preference Optimization (DPO), which suffers from linear regret growth in the distribution mismatch. We instantiate SGPO with the Stackelberg Self-Annotated Preference Optimization (SSAPO) algorithm, which iteratively self-annotates preferences and adversarially reweights synthetic annotated preferences. Using only 2K seed preferences, from the UltraFeedback dataset, i.e., 1/30 of human labels in the dataset, our method achieves 35.82% GPT-4 win-rate with Mistral-7B and 40.12% with Llama3-8B-Instruct within three rounds of SSAPO.
View on arXiv@article{chu2025_2502.18099, title={ Stackelberg Game Preference Optimization for Data-Efficient Alignment of Language Models }, author={ Xu Chu and Zhixin Zhang and Tianyu Jia and Yujie Jin }, journal={arXiv preprint arXiv:2502.18099}, year={ 2025 } }