Open-Medical-R1: How to Choose Data for RLVR Training at Medicine Domain

16 April 2025

Abstract

This paper explores optimal data selection strategies for Reinforcement Learning with Verified Rewards (RLVR) training in the medical domain. While RLVR has shown exceptional potential for enhancing reasoning capabilities in large language models, most prior implementations have focused on mathematics and logical puzzles, with limited exploration of domain-specific applications like medicine. We investigate four distinct data sampling strategies from MedQA-USMLE: random sampling (baseline), and filtering using Phi-4, Gemma-3-27b-it, and Gemma-3-12b-it models. Using Gemma-3-12b-it as our base model and implementing Group Relative Policy Optimization (GRPO), we evaluate performance across multiple benchmarks including MMLU, GSM8K, MMLU-Pro, and CMMLU. Our findings demonstrate that models trained on filtered data generally outperform those trained on randomly selected samples. Notably, training on self-filtered samples (using Gemma-3-12b-it for filtering) achieved superior performance in medical domains but showed reduced robustness across different benchmarks, while filtering with larger models from the same series yielded better overall robustness. These results provide valuable insights into effective data organization strategies for RLVR in specialized domains and highlight the importance of thoughtful data selection in achieving optimal performance. You can access our repository (this https URL) to get the codes.

View on arXiv

@article{qiu2025_2504.13950,
  title={ Open-Medical-R1: How to Choose Data for RLVR Training at Medicine Domain },
  author={ Zhongxi Qiu and Zhang Zhang and Yan Hu and Heng Li and Jiang Liu },
  journal={arXiv preprint arXiv:2504.13950},
  year={ 2025 }
}

Comments on this paper