v1v2v3 (latest)

AWPO: Enhancing Tool-Use of Large Language Models through Adaptive Integration of Reasoning Rewards

22 December 2025

Zihan Lin

Xiaohan Wang

Hexiong Yang

Jiajun Chai

Jie Cao

Guojun Yin

Wei Lin

Ran He

OffRL

LRM

ArXiv (abs)PDF HTML

Main:3 Pages

5 Figures

7 Tables

Appendix:22 Pages

Abstract

While Reinforcement Learning (RL) shows promise in training tool-use Large Language Models (LLMs) using verifiable outcome rewards, existing methods largely overlook the potential of reasoning rewards based on chain-of-thought quality for better tool utilization. Furthermore, naïvely combining reasoning and outcome rewards may yield suboptimal performance or conflict with the primary optimization objective. To address this, we propose Advantage-Weighted Policy Optimization (AWPO), a principled RL framework that adaptively integrates reasoning rewards into advantage estimation to improve tool-use performance. AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, alongside a tailored clipping mechanism for stable optimization. Extensive experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks, significantly outperforming strong baselines and leading closed-source models in challenging multi-turn scenarios. Notably, with exceptional parameter efficiency, our 4B model surpasses Grok-4 by $16.0\%$ in multi-turn accuracy while preserving generalization capability on the out-of-distribution MMLU-Pro benchmark.

View on arXiv

Comments on this paper