v1v2v3 (latest)

Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification

9 March 2025

Youssef Mroueh

OffRL

ArXiv (abs)PDF HTML

Main:10 Pages

5 Figures

Bibliography:2 Pages

Appendix:4 Pages

Abstract

Group Relative Policy Optimization (GRPO) was introduced recently and used successfully to train DeepSeek-R1 models for promoting reasoning capabilities of LLMs using verifiable or binary rewards. We show in this paper that GRPO with verifiable rewards can be written as a Kullback--Leibler (KL) regularized contrastive loss, where the contrastive samples are synthetic data sampled from the old policy. The optimal GRPO policy $\pi_{n}$ can be expressed explicitly in terms of the binary reward, as well as the first- and second-order statistics of the old policy ( $\pi_{n-1}$ ) and the reference policy $\pi_{\text{ref}}$ . Iterating this scheme, we obtain a sequence of policies $\pi_{n}$ for which we can quantify the probability of success $p_n$ . We show that the probability of success of the policy satisfies a recurrence that converges to a fixed point of a function that depends on the initial probability of success $p_{\text{ref}}$ and the regularization parameter $\beta$ of the $KL$ regularizer. We show that the fixed point $p^*$ is guaranteed to be larger than $p_{\text{ref}}$ , thereby demonstrating that GRPO effectively amplifies the probability of success of the policy.

View on arXiv

@article{mroueh2025_2503.06639,
  title={ Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification },
  author={ Youssef Mroueh },
  journal={arXiv preprint arXiv:2503.06639},
  year={ 2025 }
}

Comments on this paper