Stronger Together: On-Policy Reinforcement Learning for Collaborative LLMs

13 October 2025

Yujie Zhao

ArXiv (abs)PDF HTML HuggingFace (23 upvotes)Github (4★)

Main:9 Pages

5 Figures

Bibliography:4 Pages

3 Tables

Appendix:11 Pages

Abstract

Multi-agent systems (MAS) and reinforcement learning (RL) are widely used to enhance the agentic capabilities of large language models (LLMs). MAS improves task performance through role-based orchestration, while RL uses environmental rewards to learn stronger policies, such as GRPO-style optimization. However, applying on-policy RL to MAS remains underexplored and presents unique challenges. Algorithmically, standard GRPO grouping assumptions break down because prompts vary by role and by turn. System-wise, the training stack must support MAS-workflow rollouts and on-policy updates for both single-policy and multi-policy models.

View on arXiv

Comments on this paper