Preference Optimization with Multi-Sample Comparisons

16 October 2024

Chaoqi Wang

Zhuokai Zhao

Chen Zhu

Karthik Abinav Sankararaman

Abstract

Recent advancements in generative models, particularly large language models (LLMs) and diffusion models, have been driven by extensive pretraining on large datasets followed by post-training. However, current post-training methods such as reinforcement learning from human feedback (RLHF) and direct alignment from preference methods (DAP) primarily utilize single-sample comparisons. These approaches often fail to capture critical characteristics such as generative diversity and bias, which are more accurately assessed through multiple samples. To address these limitations, we introduce a novel approach that extends post-training to include multi-sample comparisons. To achieve this, we propose Multi-sample Direct Preference Optimization (mDPO) and Multi-sample Identity Preference Optimization (mIPO). These methods improve traditional DAP methods by focusing on group-wise characteristics. Empirically, we demonstrate that multi-sample comparison is more effective in optimizing collective characteristics~(e.g., diversity and bias) for generative models than single-sample comparison. Additionally, our findings suggest that multi-sample comparisons provide a more robust optimization framework, particularly for dataset with label noise.

View on arXiv

@article{wang2025_2410.12138,
  title={ Preference Optimization with Multi-Sample Comparisons },
  author={ Chaoqi Wang and Zhuokai Zhao and Chen Zhu and Karthik Abinav Sankararaman and Michal Valko and Xuefei Cao and Zhaorun Chen and Madian Khabsa and Yuxin Chen and Hao Ma and Sinong Wang },
  journal={arXiv preprint arXiv:2410.12138},
  year={ 2025 }
}

Comments on this paper