v1v2 (latest)

Test-Time Scaling with Reflective Generative Model

2 July 2025

Zixiao Wang

Yuxin Wang

Xiaorui Wang

Mengting Xing

Jie Gao

Jianjun Xu

Guangcan Liu

Chenhui Jin

Zhuo Wang

Shengzhuo Zhang

Hongtao Xie

LRM

ArXiv (abs)PDF HTML

Main:13 Pages

8 Figures

Bibliography:4 Pages

4 Tables

Abstract

We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 atthis https URL.

View on arXiv

Comments on this paper