Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning

Abstract
Since DeepSeek-R1 popularized, Group Relative Policy Optimization (GRPO) has become the core part of Reasoning LLMs training. However, we find some deficiency that influences RL stability and inference efficiency. Thus, we propose Adaptive Group Policy Optimization (AGPO) which contains two simple but effective modifications: a revised advantage estimation method to mitigate zero-variance situations; a length-based reward, incentivizing the model to avoid overthinking. The experiments demonstrate our methods achieve more stable training and comparable or superior performance with significantly fewer tokens in reasoning steps.
View on arXiv@article{li2025_2503.15952, title={ Adaptive Group Policy Optimization: Towards Stable Training and Token-Efficient Reasoning }, author={ Chen Li and Nazhou Liu and Kai Yang }, journal={arXiv preprint arXiv:2503.15952}, year={ 2025 } }
Comments on this paper