260

EvoCoT: Overcoming the Exploration Bottleneck in Reinforcement Learning

Main:12 Pages
4 Figures
Bibliography:1 Pages
5 Tables
Appendix:2 Pages
Abstract

Reinforcement learning with verifiable reward (RLVR) has become a promising paradigm for post-training large language models (LLMs) to improve their reasoning capability. However, when the rollout accuracy is low on hard problems, the reward becomes sparse, limiting learning efficiency and causing exploration bottlenecks. Existing approaches either rely on stronger LLMs for distillation or filter out difficult problems, which limits scalability or restricts reasoning improvement through exploration.

View on arXiv
Comments on this paper