ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2507.02834
8
0

ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning

3 July 2025
Ruiyang Zhou
Shuozhe Li
Amy Zhang
Liu Leqi
ArXiv (abs)PDFHTML
Main:11 Pages
9 Figures
Bibliography:3 Pages
5 Tables
Appendix:5 Pages
Abstract

Recent advances in large language models have been driven by reinforcement learning (RL)-style post-training, which improves reasoning by optimizing model outputs based on reward or preference signals. GRPO-style approaches implement this by using self-generated samples labeled by an outcome-based verifier. However, these methods depend heavily on the model's initial ability to produce positive samples. They primarily refine what the model already knows (distribution sharpening) rather than enabling the model to solve problems where it initially fails. This limitation is especially problematic in early-stage RL training and on challenging reasoning tasks, where positive samples are unlikely to be generated. To unlock reasoning ability in such settings, the model must explore new reasoning trajectories beyond its current output distribution. Such exploration requires access to sufficiently good positive samples to guide the learning. While expert demonstrations seem like a natural solution, we find that they are often ineffective in RL post-training. Instead, we identify two key properties of effective positive samples: they should (1) be likely under the current policy, and (2) increase the model's likelihood of predicting the correct answer. Based on these insights, we propose Self-Explanation Policy Optimization (ExPO)\textbf{Self-Explanation Policy Optimization (ExPO)}Self-Explanation Policy Optimization (ExPO)-a simple and modular framework that generates such samples by conditioning on the ground-truth answer. ExPO enables efficient exploration and guides the model to produce reasoning trajectories more aligned with its policy than expert-written CoTs, while ensuring higher quality than its own (incorrect) samples. Experiments show that ExPO improves both learning efficiency and final performance on reasoning benchmarks, surpassing expert-demonstration-based methods in challenging settings such as MATH level-5, where the model initially struggles the most.

View on arXiv
@article{zhou2025_2507.02834,
  title={ ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning },
  author={ Ruiyang Zhou and Shuozhe Li and Amy Zhang and Liu Leqi },
  journal={arXiv preprint arXiv:2507.02834},
  year={ 2025 }
}
Comments on this paper