ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.14999
18
0

Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision

21 May 2025
Eric Hanchen Jiang
Haozheng Luo
Shengyuan Pang
Xiaomin Li
Zhenting Qi
Hengli Li
Cheng Yang
Zongyu Lin
Xinfeng Li
Hao Xu
Kai-Wei Chang
Ying Nian Wu
    LRM
ArXivPDFHTML
Abstract

Mathematical reasoning presents a significant challenge for Large Language Models (LLMs), often requiring robust multi step logical consistency. While Chain of Thought (CoT) prompting elicits reasoning steps, it doesn't guarantee correctness, and improving reliability via extensive sampling is computationally costly. This paper introduces the Energy Outcome Reward Model (EORM), an effective, lightweight, post hoc verifier. EORM leverages Energy Based Models (EBMs) to simplify the training of reward models by learning to assign a scalar energy score to CoT solutions using only outcome labels, thereby avoiding detailed annotations. It achieves this by interpreting discriminator output logits as negative energies, effectively ranking candidates where lower energy is assigned to solutions leading to correct final outcomes implicitly favoring coherent reasoning. On mathematical benchmarks (GSM8k, MATH), EORM significantly improves final answer accuracy (e.g., with Llama 3 8B, achieving 90.7% on GSM8k and 63.7% on MATH). EORM effectively leverages a given pool of candidate solutions to match or exceed the performance of brute force sampling, thereby enhancing LLM reasoning outcome reliability through its streamlined post hoc verification process.

View on arXiv
@article{jiang2025_2505.14999,
  title={ Learning to Rank Chain-of-Thought: An Energy-Based Approach with Outcome Supervision },
  author={ Eric Hanchen Jiang and Haozheng Luo and Shengyuan Pang and Xiaomin Li and Zhenting Qi and Hengli Li and Cheng-Fu Yang and Zongyu Lin and Xinfeng Li and Hao Xu and Kai-Wei Chang and Ying Nian Wu },
  journal={arXiv preprint arXiv:2505.14999},
  year={ 2025 }
}
Comments on this paper