ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.16272
14
0

Learning Explainable Dense Reward Shapes via Bayesian Optimization

22 April 2025
Ryan Koo
Ian Yang
Vipul Raheja
Mingyi Hong
Kwang-Sung Jun
Dongyeop Kang
ArXivPDFHTML
Abstract

Current reinforcement learning from human feedback (RLHF) pipelines for large language model (LLM) alignment typically assign scalar rewards to sequences, using the final token as a surrogate indicator for the quality of the entire sequence. However, this leads to sparse feedback and suboptimal token-level credit assignment. In this work, we frame reward shaping as an optimization problem focused on token-level credit assignment. We propose a reward-shaping function leveraging explainability methods such as SHAP and LIME to estimate per-token rewards from the reward model. To learn parameters of this shaping function, we employ a bilevel optimization framework that integrates Bayesian Optimization and policy training to handle noise from the token reward estimates. Our experiments show that achieving a better balance of token-level reward attribution leads to performance improvements over baselines on downstream tasks and finds an optimal policy faster during training. Furthermore, we show theoretically that explainability methods that are feature additive attribution functions maintain the optimal policy as the original reward.

View on arXiv
@article{koo2025_2504.16272,
  title={ Learning Explainable Dense Reward Shapes via Bayesian Optimization },
  author={ Ryan Koo and Ian Yang and Vipul Raheja and Mingyi Hong and Kwang-Sung Jun and Dongyeop Kang },
  journal={arXiv preprint arXiv:2504.16272},
  year={ 2025 }
}
Comments on this paper