ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.09655
30
0

DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models

14 May 2025
Xiwen Chen
Wenhui Zhu
Peijie Qiu
Xuanzhao Dong
Hao Wang
Haiyu Wu
Huayu Li
Aristeidis Sotiras
Y. Wang
Abolfazl Razi
    ALM
ArXivPDFHTML
Abstract

Recent advances in reinforcement learning for language model post-training, such as Group Relative Policy Optimization (GRPO), have shown promise in low-resource settings. However, GRPO typically relies on solution-level and scalar reward signals that fail to capture the semantic diversity among sampled completions. This leads to what we identify as a diversity-quality inconsistency, where distinct reasoning paths may receive indistinguishable rewards. To address this limitation, we propose Diversity-aware Reward Adjustment\textit{Diversity-aware Reward Adjustment}Diversity-aware Reward Adjustment (DRA), a method that explicitly incorporates semantic diversity into the reward computation. DRA uses Submodular Mutual Information (SMI) to downweight redundant completions and amplify rewards for diverse ones. This encourages better exploration during learning, while maintaining stable exploitation of high-quality samples. Our method integrates seamlessly with both GRPO and its variant DR.~GRPO, resulting in DRA-GRPO\textit{DRA-GRPO}DRA-GRPO and DGA-DR. GRPO\textit{DGA-DR.~GRPO}DGA-DR. GRPO. We evaluate our method on five mathematical reasoning benchmarks and find that it outperforms recent strong baselines. It achieves state-of-the-art performance with an average accuracy of 58.2%, using only 7,000 fine-tuning samples and a total training cost of approximately 55.ThecodeisavailableatthishttpsURL.55. The code is available atthis https URL.55.ThecodeisavailableatthishttpsURL.

View on arXiv
@article{chen2025_2505.09655,
  title={ DRA-GRPO: Exploring Diversity-Aware Reward Adjustment for R1-Zero-Like Training of Large Language Models },
  author={ Xiwen Chen and Wenhui Zhu and Peijie Qiu and Xuanzhao Dong and Hao Wang and Haiyu Wu and Huayu Li and Aristeidis Sotiras and Yalin Wang and Abolfazl Razi },
  journal={arXiv preprint arXiv:2505.09655},
  year={ 2025 }
}
Comments on this paper