Policy Regularization with Noisy Advantage Values for Cooperative
Multi-agent Actor-Critic methods

Multi-Agent Reinforcement Learning (MARL) has seen revolutionary breakthroughs with its successful application to multi-agent cooperative tasks such as robot swarms control, autonomous vehicle coordination, and computer games. Recent works have applied the Proximal Policy Optimization (PPO) to the multi-agent tasks, called Multi-agent PPO (MAPPO). However, previous literature shows that the vanilla MAPPO with a shared value function may not perform as well as Independent PPO (IPPO) and the finetuned QMIX. Thus MAPPO-agent-specific (MAPPO-AS) further improves the performance of vanilla MAPPO and IPPO by the artificial agent-specific features. In addition, there is no literature that gives a theoretical analysis of the working mechanism of MAPPO. In this paper, we firstly theoretically generalize single-agent PPO to the vanilla MAPPO, which shows that the vanilla MAPPO is approximately equivalent to optimizing a multi-agent joint policy with the original PPO. Secondly, we find that vanilla MAPPO faces the problem of \textit{The Policies Overfitting in Multi-agent Cooperation(POMAC)} as they learn policies by the sampled centralized advantage values. Then POMAC may lead to updating the policies of some agents in a suboptimal direction and prevent the agents from exploring better trajectories. To solve the POMAC problem, we propose a novel policy regularization method, i.e, Noisy-MAPPO, and Advantage-Noisy-MAPPO, which smooth out the advantage values by noise. The experimental results show that the average performance of Noisy-MAPPO is better than that of finetuned QMIX and MAPPO-AS, and is much better than the vanilla MAPPO. We open-source the code at \url{https://github.com/hijkzzz/noisy-mappo}.
View on arXiv