ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2412.17256
35
7

B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners

23 December 2024
Weihao Zeng
Yuzhen Huang
Lulu Zhao
Yijun Wang
Zifei Shan
Junxian He
    LRM
ArXivPDFHTML
Abstract

In the absence of extensive human-annotated data for complex reasoning tasks, self-improvement -- where models are trained on their own outputs -- has emerged as a primary method for enhancing performance. However, the critical factors underlying the mechanism of these iterative self-improving methods remain poorly understood, such as under what conditions self-improvement is effective, and what are the bottlenecks in the current iterations. In this work, we identify and propose methods to monitor two pivotal factors in this iterative process: (1) the model's ability to generate sufficiently diverse responses (exploration); and (2) the effectiveness of external rewards in distinguishing high-quality candidates from lower-quality ones (exploitation). Using mathematical reasoning as a case study, we begin with a quantitative analysis to track the dynamics of exploration and exploitation, discovering that a model's exploratory capabilities rapidly deteriorate over iterations, and the effectiveness of exploiting external rewards diminishes as well. Motivated by these findings, we introduce B-STaR, a Self-Taught Reasoning framework that autonomously adjusts configurations across iterations to Balance exploration and exploitation, thereby optimizing the self-improving effectiveness based on the current policy model and available rewards. Our experiments on mathematical reasoning, coding, and commonsense reasoning demonstrate that B-STaR not only enhances the model's exploratory capabilities throughout training but also achieves a more effective balance between exploration and exploitation, leading to superior performance.

View on arXiv
@article{zeng2025_2412.17256,
  title={ B-STaR: Monitoring and Balancing Exploration and Exploitation in Self-Taught Reasoners },
  author={ Weihao Zeng and Yuzhen Huang and Lulu Zhao and Yijun Wang and Zifei Shan and Junxian He },
  journal={arXiv preprint arXiv:2412.17256},
  year={ 2025 }
}
Comments on this paper