Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2503.06639
Cited By
v1
v2
v3 (latest)
Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification
9 March 2025
Youssef Mroueh
OffRL
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Reinforcement Learning with Verifiable Rewards: GRPO's Effective Loss, Dynamics, and Success Amplification"
19 / 19 papers shown
Title
From Emergence to Control: Probing and Modulating Self-Reflection in Language Models
Xudong Zhu
Jiachen Jiang
Mohammad Mahdi Khalili
Zhihui Zhu
ReLM
LM&Ro
LRM
39
0
0
13 Jun 2025
Improving Data Efficiency for LLM Reinforcement Fine-tuning Through Difficulty-targeted Online Data Selection and Rollout Replay
Yifan Sun
Jingyan Shen
Yibin Wang
Tianyu Chen
Zhendong Wang
Mingyuan Zhou
Huan Zhang
85
0
0
05 Jun 2025
SuperRL: Reinforcement Learning with Supervision to Boost Language Model Reasoning
Yihao Liu
Shuocheng Li
Lang Cao
Yuhang Xie
Mengyu Zhou
Haoyu Dong
Xiaojun Ma
Shi Han
Dongmei Zhang
OffRL
ReLM
LRM
32
0
0
01 Jun 2025
MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning
Yiqing Liang
Jielin Qiu
Wenhao Ding
Zuxin Liu
James Tompkin
Mengdi Xu
Mengzhou Xia
Zhengzhong Tu
Laixi Shi
Jiacheng Zhu
OffRL
125
0
0
30 May 2025
Revisiting Group Relative Policy Optimization: Insights into On-Policy and Off-Policy Training
Youssef Mroueh
Nicolas Dupuis
Brian M. Belgodere
Apoorva Nitsure
Mattia Rigotti
Kristjan Greenewald
Jirí Navrátil
Jerret Ross
Jesus Rios
OffRL
88
0
0
28 May 2025
DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization
Gang Li
Ming Lin
Tomer Galanti
Zhengzhong Tu
Tianbao Yang
93
1
0
18 May 2025
VeriReason: Reinforcement Learning with Testbench Feedback for Reasoning-Enhanced Verilog Generation
Yiting Wang
Guoheng Sun
Wanghao Ye
Gang Qu
Ang Li
OffRL
3DV
LRM
VLM
82
0
0
17 May 2025
Search and Refine During Think: Autonomous Retrieval-Augmented Reasoning of LLMs
Yaorui Shi
Shihan Li
Chang Wu
Zhiyuan Liu
Sihang Li
Hengxing Cai
An Zhang
Xiang Wang
ReLM
LRM
162
0
0
16 May 2025
MultiClear: Multimodal Soft Exoskeleton Glove for Transparent Object Grasping Assistance
Chen Hu
Timothy Neate
Shan Luo
Letizia Gionfrida
101
12
0
04 Apr 2025
Measurement of LLM's Philosophies of Human Nature
Minheng Ni
Ennan Wu
Zidong Gong
Zhiyong Yang
Linjie Li
Chung-Ching Lin
Kevin Qinghong Lin
Lijuan Wang
Wangmeng Zuo
134
0
0
03 Apr 2025
What is the Alignment Objective of GRPO?
Milan Vojnovic
Se-Young Yun
138
5
0
25 Feb 2025
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI
Daya Guo
Dejian Yang
Haowei Zhang
Junxiao Song
...
Shiyu Wang
S. Yu
Shunfeng Zhou
Shuting Pan
S.S. Li
ReLM
VLM
OffRL
AI4TS
LRM
390
2,024
0
22 Jan 2025
Information Theoretic Guarantees For Policy Alignment In Large Language Models
Youssef Mroueh
97
8
0
09 Jun 2024
The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization
Shengyi Huang
Michael Noukhovitch
Arian Hosseini
Kashif Rasul
Weixun Wang
Lewis Tunstall
VLM
110
38
0
24 Mar 2024
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao
Peiyi Wang
Qihao Zhu
Runxin Xu
Jun-Mei Song
...
Haowei Zhang
Mingchuan Zhang
Yiming Li
Yu-Huan Wu
Daya Guo
ReLM
LRM
197
1,288
0
05 Feb 2024
Efficient Memory Management for Large Language Model Serving with PagedAttention
Woosuk Kwon
Zhuohan Li
Siyuan Zhuang
Ying Sheng
Lianmin Zheng
Cody Hao Yu
Joseph E. Gonzalez
Haotong Zhang
Ion Stoica
VLM
206
2,338
0
12 Sep 2023
Scaling Laws for Reward Model Overoptimization
Leo Gao
John Schulman
Jacob Hilton
ALM
131
569
0
19 Oct 2022
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLM
OffRL
LRM
419
4,606
0
27 Oct 2021
Proximal Policy Optimization Algorithms
John Schulman
Filip Wolski
Prafulla Dhariwal
Alec Radford
Oleg Klimov
OffRL
678
19,343
0
20 Jul 2017
1