Test-Time Scaling with Reflective Generative Model

2 July 2025

Zixiao Wang

Yuxin Wang

Xiaorui Wang

Mengting Xing

Jie Gao

Jianjun Xu

Guangcan Liu

Chenhui Jin

Zhuo Wang

Shengzhuo Zhang

Hongtao Xie

LRM

ArXiv (abs)PDF HTML

Main:13 Pages

8 Figures

Bibliography:4 Pages

4 Tables

Abstract

We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3's performance via the self-supervised process reward model (SPRM). Through sharing the backbone network and using task-specific heads for next token prediction and process scoring respectively, SPRM successfully integrates the policy model and process reward model(PRM) into a unified interface without extra process annotation, reducing over 99% PRM parameters for efficient reasoning. Equipped with SPRM, MetaStone-S1 is naturally suitable for test time scaling (TTS), and we provide three reasoning effort modes (low, medium, and high), based on the controllable thinking length. Moreover, we empirically establish a scaling law that reveals the relationship between total thinking computation and TTS performance. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI-o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at this https URL.

View on arXiv

Comments on this paper