100

Motif-2-12.7B-Reasoning: A Practitioner's Guide to RL Training Recipes

Junghwan Lim
Sungmin Lee
Dongseok Kim
Taehyun Kim
Eunhwan Park
Jeesoo Lee
Jeongdoo Lee
Junhyeok Lee
Wai Ting Cheung
Dahye Choi
Minsu Ha
Jaeheui Her
Jaeyeon Huh
Hanbin Jung
Changjin Kang
Beomgyu Kim
Minjae Kim
Taewhan Kim
Youngrok Kim
Hyukjin Kweon
Haesol Lee
Kungyu Lee
Dongpin Oh
Yeongjae Park
Bokki Ryu
Dongjoo Weon
Main:7 Pages
4 Figures
Bibliography:2 Pages
Appendix:1 Pages
Abstract

We introduce Motif-2-12.7B-Reasoning, a 12.7B parameter language model designed to bridge the gap between open-weight systems and proprietary frontier models in complex reasoning and long-context understanding. Addressing the common challenges of model collapse and training instability in reasoning adaptation, we propose a comprehensive, reproducible training recipe spanning system, data, and algorithmic optimizations. Our approach combines memory-efficient infrastructure for 64K-token contexts using hybrid parallelism and kernel-level optimizations with a two-stage Supervised Fine-Tuning (SFT) curriculum that mitigates distribution mismatch through verified, aligned synthetic data. Furthermore, we detail a robust Reinforcement Learning Fine-Tuning (RLFT) pipeline that stabilizes training via difficulty-aware data filtering and mixed-policy trajectory reuse. Empirical results demonstrate that Motif-2-12.7B-Reasoning achieves performance comparable to models with significantly larger parameter counts across mathematics, coding, and agentic benchmarks, offering the community a competitive open model and a practical blueprint for scaling reasoning capabilities under realistic compute constraints.

View on arXiv
Comments on this paper