Stackelberg Game Preference Optimization for Data-Efficient Alignment of Language Models

25 February 2025

Abstract

Aligning language models with human preferences is critical for real-world deployment, but existing methods often require large amounts of high-quality human annotations. Aiming at a data-efficient alignment method, we propose Stackelberg Game Preference Optimization (SGPO), a framework that models alignment as a two-player Stackelberg game, where a policy (leader) optimizes against a worst-case preference distribution (follower) within an $\epsilon$ -Wasserstein ball, ensuring robustness to (self-)annotation noise and distribution shifts. SGPO guarantees $O(\epsilon)$ -bounded regret, unlike Direct Preference Optimization (DPO), which suffers from linear regret growth in the distribution mismatch. We instantiate SGPO with the Stackelberg Self-Annotated Preference Optimization (SSAPO) algorithm, which iteratively self-annotates preferences and adversarially reweights synthetic annotated preferences. Using only 2K seed preferences, from the UltraFeedback dataset, i.e., 1/30 of human labels in the dataset, our method achieves 35.82% GPT-4 win-rate with Mistral-7B and 40.12% with Llama3-8B-Instruct within three rounds of SSAPO.

View on arXiv

@article{chu2025_2502.18099,
  title={ Stackelberg Game Preference Optimization for Data-Efficient Alignment of Language Models },
  author={ Xu Chu and Zhixin Zhang and Tianyu Jia and Yujie Jin },
  journal={arXiv preprint arXiv:2502.18099},
  year={ 2025 }
}

Comments on this paper