Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
2508.11408
Cited By
v1
v2 (latest)
On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting
15 August 2025
Wenhao Zhang
Yuexiang Xie
Yuchang Sun
Yanxi Chen
Guoyin Wang
Yaliang Li
Bolin Ding
Jingren Zhou
OffRL
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (6 upvotes)
Github (60061★)
Papers citing
"On-Policy RL Meets Off-Policy Experts: Harmonizing Supervised Fine-Tuning and Reinforcement Learning via Dynamic Weighting"
20 / 20 papers shown
Title
RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization
Zeng Zhiyuan
Jiashuo Liu
Zhangyue Yin
Ge Zhang
Wenhao Huang
Xipeng Qiu
44
0
0
06 Nov 2025
Do Not Step Into the Same River Twice: Learning to Reason from Trial and Error
Chenming Tang
Hsiu-Yuan Huang
Weijie Liu
Saiyong Yang
Yunfang Wu
OffRL
LRM
64
0
0
30 Oct 2025
InfoFlow: Reinforcing Search Agent Via Reward Density Optimization
Kun Luo
Hongjin Qian
Zheng Liu
Ziyi Xia
Shitao Xiao
Siqi Bao
Jun Zhao
Kang Liu
28
0
0
30 Oct 2025
ViSurf: Visual Supervised-and-Reinforcement Fine-Tuning for Large Vision-and-Language Models
Yuqi Liu
Liangyu Chen
Jiazhen Liu
Mingkang Zhu
Zhisheng Zhong
Bei Yu
Jiaya Jia
LRM
36
0
0
12 Oct 2025
HINT: Helping Ineffective Rollouts Navigate Towards Effectiveness
X. Wang
Jinyi Han
Zishang Jiang
Tingyun Li
Jiaqing Liang
Sihang Jiang
Zhaoqian Dai
Shuguang Ma
Fei Yu
Yanghua Xiao
LRM
24
0
0
10 Oct 2025
TaoSR-AGRL: Adaptive Guided Reinforcement Learning Framework for E-commerce Search Relevance
Jianhui Yang
Yiming Jin
Pengkun Jiao
Chenhe Dong
Zerui Huang
Shaowei Yao
Xiaojiang Zhou
Dan Ou
Haihong Tang
LRM
32
0
0
09 Oct 2025
CALM Before the STORM: Unlocking Native Reasoning for Optimization Modeling
Zhengyang Tang
Zihan Ye
Chenyu Huang
Xuhan Huang
Chengpeng Li
...
Ming Yan
Zizhuo Wang
Hongyuan Zha
Dayiheng Liu
Benyou Wang
LRM
41
0
0
05 Oct 2025
Selective Expert Guidance for Effective and Diverse Exploration in Reinforcement Learning of LLMs
Zishang Jiang
Jinyi Han
Tingyun Li
X. Wang
Sihang Jiang
Jiaqing Liang
Zhaoqian Dai
Shuguang Ma
Fei Yu
Yanghua Xiao
61
0
0
05 Oct 2025
More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration
Xiaoyang Yuan
Yujuan Ding
Yi Bin
Wenqi Shao
Jinyu Cai
Jingkuan Song
Yang Yang
H. Shen
LRM
95
0
1
02 Oct 2025
ExGRPO: Learning to Reason from Experience
Runzhe Zhan
Yafu Li
Zhi Wang
Xiaoye Qu
Dongrui Liu
Jing Shao
Derek F. Wong
Yu Cheng
OffRL
LRM
61
0
1
02 Oct 2025
UniAPL: A Unified Adversarial Preference Learning Framework for Instruct-Following
FaQiang Qian
WeiKun Zhang
Ziliang Wang
Kang An
Xuhui Zheng
Liangjian Wen
Mengya Gao
Yong Dai
Yichao Wu
20
0
0
29 Sep 2025
Group-Relative REINFORCE Is Secretly an Off-Policy Algorithm: Demystifying Some Myths About GRPO and Its Friends
Chaorui Yao
Yanxi Chen
Yuchang Sun
Yushuo Chen
Wenhao Zhang
Xuchen Pan
Yaliang Li
Bolin Ding
OffRL
28
2
0
29 Sep 2025
Scaling Generalist Data-Analytic Agents
Shuofei Qiao
Yanqiu Zhao
Zhisong Qiu
Xiaobin Wang
Jintian Zhang
...
Ningyu Zhang
Yong Jiang
Pengjun Xie
Fei Huang
Huajun Chen
28
0
0
29 Sep 2025
Dynamic-TreeRPO: Breaking the Independent Trajectory Bottleneck with Structured Sampling
Xiaolong Fu
Lichen Ma
Zipeng Guo
Gaojing Zhou
Chongxiao Wang
...
Tan Lit Sin
Yu Shi
Zhen Chen
Junshi Huang
Jason Li
47
0
0
27 Sep 2025
Learn the Ropes, Then Trust the Wins: Self-imitation with Progressive Exploration for Agentic Reinforcement Learning
Yulei Qin
Xiaoyu Tan
Zhengbao He
Gang Li
Haojia Lin
...
Yuzheng Cai
Xuan Zhang
Sheng Ye
Ke Li
Xing Sun
113
0
0
26 Sep 2025
OraPO: Oracle-educated Reinforcement Learning for Data-efficient and Factual Radiology Report Generation
Zhuoxiao Chen
Hongyang Yu
Ying Xu
Yadan Luo
Long Duong
Yuan-Fang Li
OffRL
MedIm
68
0
0
23 Sep 2025
Inpainting-Guided Policy Optimization for Diffusion Large Language Models
Siyan Zhao
Mengchen Liu
Jing Huang
M. Liu
Chenyu Wang
...
Yuandong Tian
Guan Pang
Sean Bell
Aditya Grover
Feiyu Chen
AI4CE
22
1
0
12 Sep 2025
Staying in the Sweet Spot: Responsive Reasoning Evolution via Capability-Adaptive Hint Scaffolding
Ziheng Li
Guoqing Liu
Jinman Zhao
Erxue Min
Yongcheng Zeng
...
Hengyi Cai
Shuaiqiang Wang
D. Yin
Xu Chen
Zhi-Hong Deng
LRM
36
1
0
08 Sep 2025
Beyond Two-Stage Training: Cooperative SFT and RL for LLM Reasoning
Liang Chen
Xueting Han
Li Shen
Jing Bai
Kam-Fai Wong
OffRL
ReLM
LRM
80
7
0
08 Sep 2025
Towards a Unified View of Large Language Model Post-Training
Xingtai Lv
Yuxin Zuo
Youbang Sun
Hongyi Liu
Yuntian Wei
...
Xuekai Zhu
Kaiyan Zhang
Bingning Wang
Ning Ding
Bowen Zhou
OffRL
40
6
0
04 Sep 2025
1