93
v1v2 (latest)

Test-Time Scaling with Reflective Generative Model

Zixiao Wang
Yuxin Wang
Xiaorui Wang
Mengting Xing
Jie Gao
Jianjun Xu
Guangcan Liu
Chenhui Jin
Zhuo Wang
Shengzhuo Zhang
Hongtao Xie
Main:13 Pages
8 Figures
Bibliography:4 Pages
4 Tables
Abstract

We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 atthis https URL.

View on arXiv
Comments on this paper