ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2502.01456
  4. Cited By
Process Reinforcement through Implicit Rewards
v1v2 (latest)

Process Reinforcement through Implicit Rewards

3 February 2025
Ganqu Cui
Lifan Yuan
Liang Luo
Hanbin Wang
Wendi Li
Bingxiang He
Wendi Li
Tianyu Yu
Qixin Xu
Weize Chen
Qixin Xu
Huayu Chen
Kaiyan Zhang
Xingtai Lv
Kaiyan Zhang
Xingtai Lv
Xu Han
Yuan Yao
Yu Cheng
Zhiyuan Liu
Maosong Sun
Zhiyuan Liu
Ning Ding
Bowen Zhou
Ning Ding
    OffRLLRM
ArXiv (abs)PDFHTMLHuggingFace (62 upvotes)

Papers citing "Process Reinforcement through Implicit Rewards"

11 / 161 papers shown
Title
Thinking Machines: A Survey of LLM based Reasoning Strategies
Dibyanayan Bandyopadhyay
Soham Bhattacharjee
Asif Ekbal
LRMELM
185
21
0
13 Mar 2025
ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning
ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning
Bo Liu
Yunxiang Li
Yangqiu Song
Hanjing Wang
Linyi Yang
...
Jun Wang
Jun Wang
Weinan Zhang
Shuyue Hu
Ying Wen
LLMAGKELMLRMAI4CE
411
33
0
12 Mar 2025
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training
Tong Wei
Yijun Yang
Junliang Xing
Yuanchun Shi
Zongqing Lu
Deheng Ye
OffRLLRM
204
6
0
11 Mar 2025
Optimizing Test-Time Compute via Meta Reinforcement Fine-Tuning
Yuxiao Qu
Matthew Y. R. Yang
Amrith Rajagopal Setlur
Lewis Tunstall
E. Beeching
Ruslan Salakhutdinov
Aviral Kumar
OffRL
326
81
0
10 Mar 2025
Soft Policy Optimization: Online Off-Policy RL for Sequence Models
Taco Cohen
David W. Zhang
Kunhao Zheng
Yunhao Tang
Rémi Munos
Gabriel Synnaeve
OffRL
186
5
0
07 Mar 2025
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs
Kanishk Gandhi
Ayush Chakravarthy
Anikait Singh
Nathan Lile
Noah D. Goodman
ReLMLRM
418
266
0
03 Mar 2025
Self-rewarding correction for mathematical reasoning
Self-rewarding correction for mathematical reasoning
Wei Xiong
Hanning Zhang
Chenlu Ye
Lichang Chen
Nan Jiang
Tong Zhang
ReLMKELMLRM
364
36
0
26 Feb 2025
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning
Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning
Wenkai Yang
Shuming Ma
Yankai Lin
Furu Wei
LRM
385
85
0
25 Feb 2025
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning
S2^22R: Teaching LLMs to Self-verify and Self-correct via Reinforcement LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Ruotian Ma
Peisong Wang
Cheng Liu
Xingyan Liu
Jiaqi Chen
Bang Zhang
Xin Zhou
Nan Du
Jia Li
LRM
362
8
0
18 Feb 2025
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team
Angang Du
Bofei Gao
Bowei Xing
Changjiu Jiang
...
Zihao Huang
Ziyao Xu
Zhiyong Yang
Zonghan Yang
Zongyu Lin
OffRLALMAI4TSVLMLRM
834
655
0
22 Jan 2025
DPO Meets PPO: Reinforced Token Optimization for RLHF
DPO Meets PPO: Reinforced Token Optimization for RLHF
Han Zhong
Zikang Shan
Guhao Feng
Wei Xiong
Xinle Cheng
Li Zhao
Di He
Jiang Bian
Liwei Wang
533
96
0
29 Apr 2024
Previous
1234