17
11

Scalable Primal-Dual Actor-Critic Method for Safe Multi-Agent RL with General Utilities

Abstract

We investigate safe multi-agent reinforcement learning, where agents seek to collectively maximize an aggregate sum of local objectives while satisfying their own safety constraints. The objective and constraints are described by {\it general utilities}, i.e., nonlinear functions of the long-term state-action occupancy measure, which encompass broader decision-making goals such as risk, exploration, or imitations. The exponential growth of the state-action space size with the number of agents presents challenges for global observability, further exacerbated by the global coupling arising from agents' safety constraints. To tackle this issue, we propose a primal-dual method utilizing shadow reward and κ\kappa-hop neighbor truncation under a form of correlation decay property, where κ\kappa is the communication radius. In the exact setting, our algorithm converges to a first-order stationary point (FOSP) at the rate of O(T2/3)\mathcal{O}\left(T^{-2/3}\right). In the sample-based setting, we demonstrate that, with high probability, our algorithm requires O~(ϵ3.5)\widetilde{\mathcal{O}}\left(\epsilon^{-3.5}\right) samples to achieve an ϵ\epsilon-FOSP with an approximation error of O(ϕ02κ)\mathcal{O}(\phi_0^{2\kappa}), where ϕ0(0,1)\phi_0\in (0,1). Finally, we demonstrate the effectiveness of our model through extensive numerical experiments.

View on arXiv
Comments on this paper