v1v2 (latest)

GRPO is Secretly a Process Reward Model

25 September 2025

Michael Sullivan

ArXiv (abs)PDF HTML Github (2★)

Main:8 Pages

7 Figures

Bibliography:3 Pages

2 Tables

Appendix:4 Pages

Abstract

We prove theoretically that the GRPO RL algorithm induces a non-trivial process reward model (PRM), under certain assumptions regarding within-group overlap of token sequences across completions. We then show empirically that these assumptions are met under real-world conditions: GRPO does in fact induce a non-trivial PRM. Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective: non-uniformly distributed process steps hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ( $\lambda$ -GRPO), and show that LLMs trained with $\lambda$ -GRPO achieve higher validation accuracy and performance on downstream reasoning tasks $-$ and reach peak performance more rapidly $-$ than LLMs trained with standard GRPO. Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO: we show that it is possible to instead leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance with a negligible impact on training time and cost.

View on arXiv

Comments on this paper

All Papers

0 / 0 papers shown

Title