ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.22257
88
0
v1v2 (latest)

Revisiting Group Relative Policy Optimization: Insights into On-Policy and Off-Policy Training

28 May 2025
Youssef Mroueh
Nicolas Dupuis
Brian M. Belgodere
Apoorva Nitsure
Mattia Rigotti
Kristjan Greenewald
Jirí Navrátil
Jerret Ross
Jesus Rios
    OffRL
ArXiv (abs)PDFHTML
Main:10 Pages
4 Figures
Bibliography:2 Pages
3 Tables
Appendix:5 Pages
Abstract

We revisit Group Relative Policy Optimization (GRPO) in both on-policy and off-policy optimization regimes. Our motivation comes from recent work on off-policy Proximal Policy Optimization (PPO), which improves training stability, sampling efficiency, and memory usage. In addition, a recent analysis of GRPO suggests that estimating the advantage function with off-policy samples could be beneficial. Building on these observations, we adapt GRPO to the off-policy setting. We show that both on-policy and off-policy GRPO objectives yield an improvement in the reward. This result motivates the use of clipped surrogate objectives in the off-policy version of GRPO. We then compare the empirical performance of reinforcement learning with verifiable rewards in post-training using both GRPO variants. Our results show that off-policy GRPO either significantly outperforms or performs on par with its on-policy counterpart.

View on arXiv
@article{mroueh2025_2505.22257,
  title={ Revisiting Group Relative Policy Optimization: Insights into On-Policy and Off-Policy Training },
  author={ Youssef Mroueh and Nicolas Dupuis and Brian Belgodere and Apoorva Nitsure and Mattia Rigotti and Kristjan Greenewald and Jiri Navratil and Jerret Ross and Jesus Rios },
  journal={arXiv preprint arXiv:2505.22257},
  year={ 2025 }
}
Comments on this paper