ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2508.01049
64
0

Centralized Adaptive Sampling for Reliable Co-Training of Independent Multi-Agent Policies

1 August 2025
Nicholas Corrado
Josiah P. Hanna
ArXiv (abs)PDFHTML
Main:10 Pages
14 Figures
Bibliography:4 Pages
3 Tables
Appendix:13 Pages
Abstract

Independent on-policy policy gradient algorithms are widely used for multi-agent reinforcement learning (MARL) in cooperative and no-conflict games, but they are known to converge suboptimally when each agent's policy gradient points toward a suboptimal equilibrium. In this work, we identify a subtler failure mode that arises \textit{even when the expected policy gradients of all agents point toward an optimal solution.} After collecting a finite set of trajectories, stochasticity in independent action sampling can cause the joint data distribution to deviate from the expected joint on-policy distribution. This \textit{sampling error} w.r.t. the joint on-policy distribution produces inaccurate gradient estimates that can lead agents to converge suboptimally. In this paper, we investigate if joint sampling error can be reduced through coordinated action selection and whether doing so improves the reliability of policy gradient learning in MARL. Toward this end, we introduce an adaptive action sampling approach to reduce joint sampling error. Our method, Multi-Agent Proximal Robust On-Policy Sampling (MA-PROPS), uses a centralized behavior policy that we continually adapt to place larger probability on joint actions that are currently under-sampled w.r.t. the current joint policy. We empirically evaluate MA-PROPS in a diverse range of multi-agent games and demonstrate that (1) MA-PROPS reduces joint sampling error more efficiently than standard on-policy sampling and (2) improves the reliability of independent policy gradient algorithms, increasing the fraction of training runs that converge to an optimal joint policy.

View on arXiv
Comments on this paper