160
13
v1v2v3 (latest)

Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification

Main:10 Pages
5 Figures
Bibliography:2 Pages
Appendix:4 Pages
Abstract

Group Relative Policy Optimization (GRPO) was introduced recently and used successfully to train DeepSeek-R1 models for promoting reasoning capabilities of LLMs using verifiable or binary rewards. We show in this paper that GRPO with verifiable rewards can be written as a Kullback--Leibler (KL) regularized contrastive loss, where the contrastive samples are synthetic data sampled from the old policy. The optimal GRPO policy πn\pi_{n} can be expressed explicitly in terms of the binary reward, as well as the first- and second-order statistics of the old policy (πn1\pi_{n-1}) and the reference policy πref\pi_{\text{ref}}. Iterating this scheme, we obtain a sequence of policies πn\pi_{n} for which we can quantify the probability of success pnp_n. We show that the probability of success of the policy satisfies a recurrence that converges to a fixed point of a function that depends on the initial probability of success prefp_{\text{ref}} and the regularization parameter β\beta of the KLKL regularizer. We show that the fixed point pp^* is guaranteed to be larger than prefp_{\text{ref}}, thereby demonstrating that GRPO effectively amplifies the probability of success of the policy.

View on arXiv
@article{mroueh2025_2503.06639,
  title={ Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification },
  author={ Youssef Mroueh },
  journal={arXiv preprint arXiv:2503.06639},
  year={ 2025 }
}
Comments on this paper