Noisy-MAPPO: Noisy Advantage Values for Cooperative Multi-agent
Actor-Critic methods
Multi-Agent Reinforcement Learning (MARL) has seen revolutionary breakthroughs with its successful application to multi-agent cooperative tasks such as robot swarms control, autonomous vehicle coordination, and computer games. Recent works have applied the Proximal Policy Optimization (PPO) to the multi-agent tasks, called Multi-agent PPO (MAPPO). However, the MAPPO in current works lacks theoretical support for its convergence; and requires artificial agent-specific features, called MAPPO-agent-specific (MAPPO-AS). In addition, the performance of MAPPO-AS is still lower than the finetuned QMIX on the popular benchmark environment StarCraft Multi-agent Challenge (SMAC). In this paper, we firstly theoretically generalize single-agent PPO to MAPPO, which can be seen as theoretical support for MAPPO. Secondly, since the sampled advantage values in vanilla MAPPO may mislead the learning of some agents, which are not related to these advantage values, called \textit{The Policies Overfitting in Multi-agent Cooperation(POMAC)} problem. We propose the Noisy Advantage-Values (Noisy-MAPPO and Advantage-Noisy-MAPPO) to solve it. The experimental results show that the average performance of Noisy-MAPPO is better than that of finetuned QMIX; Noisy-MAPPO is the first algorithm that achieves more than 90\% winning rates in all SMAC scenarios. We open-source the code at \url{https://github.com/hijkzzz/noisy-mappo}.
View on arXiv