18
0

Reasoning without Regret

Abstract

Chain-of-thought reasoning enables large language models to solve multi-step tasks by framing problem solving as sequential decision problems. Outcome-based rewards, which provide feedback only on final answers, show impressive success, but face challenges with credit assignment and slow convergence. In contrast, procedure-based rewards offer efficient step-level feedback, but typically require costly human supervision. We introduce \emph{Backwards Adaptive Reward Shaping} (BARS), a no-regret framework that converts sparse outcomes-based rewards into effective procedure-based signals. BARS uses sparse rewards generated from terminal-state priors and cover trees to scale rewards while preventing exploitation. With Bellman contraction and (Δ,ϵ)(\Delta, \epsilon)-gap rewards, our backward Euler solver achieves ϵ\epsilon-accuracy in O((Rmax/Δ)log(1/ϵ))O\left((R_{\max}/\Delta)\log(1/\epsilon)\right) iterations with O(logT)O(\log T) dynamic regret over TT rounds. Our analysis, based on generic chaining, continuous scaling limits, and non-linear Feynman-Kac bounds, connects recent outcome-based methods' empirical successes with the benefits of intermediate supervision. Combined, this provides the first rigorous no-regret algorithm for outcome reward shaping, providing a theoretical foundation for the empirical success of DeepSeek's R1.

View on arXiv
@article{chitra2025_2504.09777,
  title={ Reasoning without Regret },
  author={ Tarun Chitra },
  journal={arXiv preprint arXiv:2504.09777},
  year={ 2025 }
}
Comments on this paper