159
v1v2v3 (latest)

AWPO: Enhancing Tool-Use of Large Language Models through Adaptive Integration of Reasoning Rewards

Zihan Lin
Xiaohan Wang
Hexiong Yang
Jiajun Chai
Jie Cao
Guojun Yin
Wei Lin
Ran He
Main:3 Pages
5 Figures
7 Tables
Appendix:22 Pages
Abstract

While Reinforcement Learning (RL) shows promise in training tool-use Large Language Models (LLMs) using verifiable outcome rewards, existing methods largely overlook the potential of reasoning rewards based on chain-of-thought quality for better tool utilization. Furthermore, naïvely combining reasoning and outcome rewards may yield suboptimal performance or conflict with the primary optimization objective. To address this, we propose Advantage-Weighted Policy Optimization (AWPO), a principled RL framework that adaptively integrates reasoning rewards into advantage estimation to improve tool-use performance. AWPO incorporates variance-aware gating and difficulty-aware weighting to adaptively modulate advantages from reasoning signals based on group-relative statistics, alongside a tailored clipping mechanism for stable optimization. Extensive experiments demonstrate that AWPO achieves state-of-the-art performance across standard tool-use benchmarks, significantly outperforming strong baselines and leading closed-source models in challenging multi-turn scenarios. Notably, with exceptional parameter efficiency, our 4B model surpasses Grok-4 by 16.0%16.0\% in multi-turn accuracy while preserving generalization capability on the out-of-distribution MMLU-Pro benchmark.

View on arXiv
Comments on this paper