187

Probability-Consistent Preference Optimization for Enhanced LLM Reasoning

Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Main:8 Pages
5 Figures
Bibliography:3 Pages
6 Tables
Appendix:3 Pages
Abstract

Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available atthis https URL.

View on arXiv
Comments on this paper