660

Noisy-MAPPO: Noisy Advantage Values for Cooperative Multi-agent Actor-Critic methods

Abstract

Multi-Agent Reinforcement Learning (MARL) has seen revolutionary breakthroughs with its successful application to multi-agent cooperative tasks such as robot swarms control, autonomous vehicle coordination, and computer games. Recent works have applied the Proximal Policy Optimization (PPO) to the multi-agent tasks, called Multi-agent PPO (MAPPO). However, the MAPPO in current works lacks theoretical support, and requires artificial agent-specific features, called MAPPO-agent-specific (MAPPO-AS). In addition, the performance of MAPPO-AS is still lower than the finetuned QMIX on the popular benchmark environment StarCraft Multi-agent Challenge (SMAC). In this paper, we firstly theoretically generalize single-agent PPO to the vanilla MAPPO, which shows that the vanilla MAPPO is equivalent to optimizing a multi-agent joint policy with the original PPO approximately. Secondly, since the centralized advantages function in vanilla MAPPO lacks a credit allocation mechanism, which may lead to updating the policies of some agents in a suboptimal direction. Then this problem may prevent the agents from exploring better trajectories, called \textit{The Policies Overfitting in Multi-agent Cooperation(POMAC)}. To solve the POMAC, we propose the Noisy Advantage-Values (Noisy-MAPPO and Advantage-Noisy-MAPPO) which smooth out the advantage values, likewise label smoothing. The experimental results show that the average performance of Noisy-MAPPO is better than that of finetuned QMIX and MAPPO-AS, and is much better than the vanilla MAPPO. We open-source the code at \url{https://github.com/hijkzzz/noisy-mappo}.

View on arXiv
Comments on this paper