289
v1v2 (latest)

Direct Reasoning Optimization: Constrained RL with Token-Level Dense Reward and Rubric-Gated Constraints for Open-ended Tasks

Main:7 Pages
7 Figures
Bibliography:3 Pages
5 Tables
Appendix:12 Pages
Abstract

RL training of LLMs on open-ended tasks is challenging due to the lack of direct verifiability. In this paper, we frame such training as constrained RL that (i) optimizes a token-level dense Reasoning Reflection Reward (R3) aligned with reasoning quality, and (ii) enforces rubric-gating as feasibility constraints at the rollout group level. R3 measures the model's token-level certainty of a reference answer under its CoT reasoning prefix while selectively emphasizing reasoning-reflective tokens to capture how likely the generated reasoning is to yield the desired answer. Rubric-gating complements R3 by operationalizing principled task criteria as hard accept/reject checks on final answers. Empirically, across four datasets, our framework outperforms baselines, achieves faster, more sample-efficient learning, and respects feasibility constraints.

View on arXiv
Comments on this paper