ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.19618
45
0

Learning to chain-of-thought with Jensen's evidence lower bound

25 March 2025
Yunhao Tang
Sid Wang
Rémi Munos
    BDL
    OffRL
    LRM
ArXivPDFHTML
Abstract

We propose a way to optimize chain-of-thought with reinforcement learning, but without external reward function. Our algorithm relies on viewing chain-of-thought as latent variable as part of a probabilistic inference problem. Contrary to the full evidence lower bound, we propose to apply a much simpler Jensen's lower bound, which derives tractable objectives with simple algorithmic components (e.g., without the need for parametric approximate posterior), making it more conducive to modern large-scale training. The lower bound approach naturally interpolates other methods such as supervised fine-tuning and online reinforcement learning, whose practical trade-offs we will illustrate. Finally, we show that on mathematical reasoning problems, optimizing with Jensen's lower bound is as effective as policy gradient with external reward. Taken together, our results showcase as a proof of concept to this new algorithmic paradigm's potential to more generic applications.

View on arXiv
@article{tang2025_2503.19618,
  title={ Learning to chain-of-thought with Jensen's evidence lower bound },
  author={ Yunhao Tang and Sid Wang and Rémi Munos },
  journal={arXiv preprint arXiv:2503.19618},
  year={ 2025 }
}
Comments on this paper