ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.06639
162
13
v1v2v3 (latest)

Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification

9 March 2025
Youssef Mroueh
    OffRL
ArXiv (abs)PDFHTML
Main:10 Pages
5 Figures
Bibliography:2 Pages
Appendix:4 Pages
Abstract

Group Relative Policy Optimization (GRPO) was introduced and used successfully to train DeepSeek R1 models for promoting reasoning capabilities of LLMs using verifiable or binary rewards. We show in this paper that GRPO with verifiable rewards can be written as a Kullback Leibler (KL\mathsf{KL}KL) regularized contrastive loss, where the contrastive samples are synthetic data sampled from the old policy. The optimal GRPO policy πn\pi_{n}πn​ can be expressed explicitly in terms of the binary reward, as well as the first and second order statistics of the old policy (πn−1\pi_{n-1}πn−1​) and the reference policy π0\pi_0π0​. Iterating this scheme, we obtain a sequence of policies πn\pi_{n}πn​ for which we can quantify the probability of success pnp_npn​. We show that the probability of success of the policy satisfies a recurrence that converges to a fixed point of a function that depends on the initial probability of success p0p_0p0​ and the regularization parameter β\betaβ of the KL\mathsf{KL}KL regularizer. We show that the fixed point p∗p^*p∗ is guaranteed to be larger than p0p_0p0​, thereby demonstrating that GRPO effectively amplifies the probability of success of the policy.

View on arXiv
@article{mroueh2025_2503.06639,
  title={ Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification },
  author={ Youssef Mroueh },
  journal={arXiv preprint arXiv:2503.06639},
  year={ 2025 }
}
Comments on this paper