EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning
- OffRLLRM

Main:12 Pages
4 Figures
Bibliography:1 Pages
5 Tables
Appendix:2 Pages
Abstract
Reinforcement learning with verifiable reward (RLVR) has become a promising paradigm for post-training large language models (LLMs) to improve their reasoning capability. However, when the rollout accuracy is low on hard problems, the reward becomes sparse, limiting learning efficiency and causing exploration bottlenecks. Existing approaches either rely on stronger LLMs for distillation or filter out difficult problems, which limits scalability or restricts reasoning improvement through exploration.
View on arXivComments on this paper
