SHAQ: Incorporating Shapley Value Theory into Multi-Agent Q-Learning

Neural Information Processing Systems (NeurIPS), 2021

31 May 2021

Jianhong Wang

ArXiv (abs)PDF HTML Github (47★)

Abstract

Value factorisation proves to be a useful technique for multi-agent reinforcement learning (MARL) in global reward game, but the underlying mechanism is not yet fully understood. This paper explores a theoretical framework for value factorisation with interpretability through Shapley value theory. We generalise Shapley value to Markov convex game called Markov Shapley value and apply it as a value factorisation method in global reward game, thanks to the equivalence between these two games. Based on the property of Markov Shapley value, we derive Shapley-Bellman optimality equation to evaluate the optimal Markov Shapley value that is corresponding to optimal joint deterministic policy. Furthermore, we propose Shapley-Bellman operator that is proved to solve Shapley-Bellman optimality equation. With stochastic approximation and some transformation, a new MARL algorithm called Shapley Q-learning (SHAQ) is yielded, the implementation of which is guided by the theoretical results of Shapley-Bellman operator and Markov Shapley value. In experiments, we show that SHAQ possesses not only superior performances on all tasks but also the interpretability that agrees with the theoretical analysis of Markov Shapley value and Shapley-Bellman operator.

View on arXiv

Comments on this paper