mR3: Multilingual Rubric-Agnostic Reward Reasoning Models
- LRM
Evaluation using Large Language Model (LLM) judges has been widely adopted in English and shown to be effective for automatic evaluation. However, their performance does not generalize well to non-English settings, and it remains unclear what constitutes effective multilingual training for such judges. In this paper, we introduce mR3, a massively multilingual, rubric-agnostic reward reasoning model trained on 72 languages, achieving the broadest language coverage in reward modeling to date. We present a comprehensive study of data and curriculum selection for training to identify effective strategies and data sources for building high-quality reward models, including support for reasoning in the target language. Our approach attains state-of-the-art performance on multilingual reward model benchmarks, surpassing much larger models (i.e., GPT-OSS-120B) while being up to 9x smaller, and its effectiveness is further confirmed through extensive ablation studies. Finally, we demonstrate the effectiveness of mR3 in off-policy preference optimization and validate the quality of its reasoning traces and rubric-based evaluations through human studies with 20 annotators across 12 languages, where mR3 models' reasoning is preferred, including for extremely low-resource languages that are entirely unseen during training. Our models, data, and code are available as open source atthis https URL.
View on arXiv