ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2109.04024
17
43

On the Approximation of Cooperative Heterogeneous Multi-Agent Reinforcement Learning (MARL) using Mean Field Control (MFC)

9 September 2021
Washim Uddin Mondal
Mridul Agarwal
Vaneet Aggarwal
S. Ukkusuri
ArXivPDFHTML
Abstract

Mean field control (MFC) is an effective way to mitigate the curse of dimensionality of cooperative multi-agent reinforcement learning (MARL) problems. This work considers a collection of NpopN_{\mathrm{pop}}Npop​ heterogeneous agents that can be segregated into KKK classes such that the kkk-th class contains NkN_kNk​ homogeneous agents. We aim to prove approximation guarantees of the MARL problem for this heterogeneous system by its corresponding MFC problem. We consider three scenarios where the reward and transition dynamics of all agents are respectively taken to be functions of (1)(1)(1) joint state and action distributions across all classes, (2)(2)(2) individual distributions of each class, and (3)(3)(3) marginal distributions of the entire population. We show that, in these cases, the KKK-class MARL problem can be approximated by MFC with errors given as e1=O(∣X∣+∣U∣Npop∑kNk)e_1=\mathcal{O}(\frac{\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}}{N_{\mathrm{pop}}}\sum_{k}\sqrt{N_k})e1​=O(Npop​∣X∣​+∣U∣​​∑k​Nk​​), e2=O([∣X∣+∣U∣]∑k1Nk)e_2=\mathcal{O}(\left[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}\right]\sum_{k}\frac{1}{\sqrt{N_k}})e2​=O([∣X∣​+∣U∣​]∑k​Nk​​1​) and e3=O([∣X∣+∣U∣][ANpop∑k∈[K]Nk+BNpop])e_3=\mathcal{O}\left(\left[\sqrt{|\mathcal{X}|}+\sqrt{|\mathcal{U}|}\right]\left[\frac{A}{N_{\mathrm{pop}}}\sum_{k\in[K]}\sqrt{N_k}+\frac{B}{\sqrt{N_{\mathrm{pop}}}}\right]\right)e3​=O([∣X∣​+∣U∣​][Npop​A​∑k∈[K]​Nk​​+Npop​​B​]), respectively, where A,BA, BA,B are some constants and ∣X∣,∣U∣|\mathcal{X}|,|\mathcal{U}|∣X∣,∣U∣ are the sizes of state and action spaces of each agent. Finally, we design a Natural Policy Gradient (NPG) based algorithm that, in the three cases stated above, can converge to an optimal MARL policy within O(ej)\mathcal{O}(e_j)O(ej​) error with a sample complexity of O(ej−3)\mathcal{O}(e_j^{-3})O(ej−3​), j∈{1,2,3}j\in\{1,2,3\}j∈{1,2,3}, respectively.

View on arXiv
Comments on this paper