This paper presents the Entropy-Driven Unified Process Reward Model (EDU-PRM), a novel framework that approximates state-of-the-art performance in process supervision while drastically reducing training costs. EDU-PRM introduces an entropy-guided dynamic step partitioning mechanism, using logit distribution entropy to pinpoint high-uncertainty regions during token generation dynamically. This self-assessment capability enables precise step-level feedback without manual fine-grained annotation, addressing a critical challenge in process supervision. Experiments on the Qwen2.5-72B model with only 7,500 EDU-PRM-generated training queries demonstrate accuracy closely approximating the full Qwen2.5-72B-PRM (71.1% vs. 71.6%), achieving a 98% reduction in query cost compared to prior methods. This work establishes EDU-PRM as an efficient approach for scalable process reward model training.
View on arXiv@article{cao2025_2503.22233, title={ Process Reward Modeling with Entropy-Driven Uncertainty }, author={ Lang Cao and Renhong Chen and Yingtian Zou and Chao Peng and Wu Ning and Huacong Xu and Qian Chen and Yuxian Wang and Peishuo Su and Mofan Peng and Zijie Chen and Yitong Li }, journal={arXiv preprint arXiv:2503.22233}, year={ 2025 } }