TSPO: Breaking the Double Homogenization Dilemma in Multi-turn Search Policy Optimization

30 January 2026

Shichao Ma

Zhiyuan Ma

Ming Yang

Xiaofan Li

Xing Wu

Jintao Du

Yu Cheng

Weiqiang Wang

Qiliang Liu

Zhengyang Zhou

Yang Wang

LRM

ArXiv (abs)PDF HTML Github (137★)

Main:8 Pages

7 Figures

Bibliography:3 Pages

7 Tables

Appendix:5 Pages

Abstract

Multi-turn tool-integrated reasoning enables Large Language Models (LLMs) to solve complex tasks through iterative information retrieval. However, current reinforcement learning (RL) frameworks for search-augmented reasoning predominantly rely on sparse outcome-level rewards, leading to a "Double Homogenization Dilemma." This manifests as (1) Process homogenization, where the thinking, reasoning, and tooling involved in generation are ignored. (2) Intra-group homogenization, coarse-grained outcome rewards often lead to inefficiencies in intra-group advantage estimation with methods like Group Relative Policy Optimization (GRPO) during sampling. To address this, we propose Turn-level Stage-aware Policy Optimization (TSPO). TSPO introduces the First-Occurrence Latent Reward (FOLR) mechanism, allocating partial rewards to the step where the ground-truth answer first appears, thereby preserving process-level signals and increasing reward variance within groups without requiring external reward models or any annotations. Extensive experiments demonstrate that TSPO significantly outperforms state-of-the-art baselines, achieving average performance gains of 24% and 13.6% on Qwen2.5-3B and 7B models, respectively.

View on arXiv

Comments on this paper