ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2010.01604
220
129
v1v2 (latest)

A Sharp Analysis of Model-based Reinforcement Learning with Self-Play

International Conference on Machine Learning (ICML), 2020
4 October 2020
Qinghua Liu
Tiancheng Yu
Yu Bai
Chi Jin
ArXiv (abs)PDFHTML
Abstract

Model-based algorithms -- algorithms that explore the environment through building and utilizing an estimated model -- are widely used in reinforcement learning practice and theoretically shown to achieve optimal sample efficiency for single-agent reinforcement learning in Markov Decision Processes (MDPs). However, for multi-agent reinforcement learning in Markov games, the current best known sample complexity for model-based algorithms is rather suboptimal and compares unfavorably against recent model-free approaches. In this paper, we present a sharp analysis of model-based self-play algorithms for multi-agent Markov games. We design an algorithm -- Optimistic Nash Value Iteration (Nash-VI) for two-player zero-sum Markov games that is able to output an ϵ\epsilonϵ-approximate Nash policy in O~(H3SAB/ϵ2)\tilde{\mathcal{O}}(H^3SAB/\epsilon^2)O~(H3SAB/ϵ2) episodes of game playing, where SSS is the number of states, A,BA,BA,B are the number of actions for the two players respectively, and HHH is the horizon length. This significantly improves over the best known model-based guarantee of O~(H4S2AB/ϵ2)\tilde{\mathcal{O}}(H^4S^2AB/\epsilon^2)O~(H4S2AB/ϵ2), and is the first that matches the information-theoretic lower bound Ω(H3S(A+B)/ϵ2)\Omega(H^3S(A+B)/\epsilon^2)Ω(H3S(A+B)/ϵ2) except for a min⁡{A,B}\min\{A,B\}min{A,B} factor. In addition, our guarantee compares favorably against the best known model-free algorithm if min⁡{A,B}=o(H3)\min \{A,B\}=o(H^3)min{A,B}=o(H3), and outputs a single Markov policy while existing sample-efficient model-free algorithms output a nested mixture of Markov policies that is in general non-Markov and rather inconvenient to store and execute. We further adapt our analysis to designing a provably efficient task-agnostic algorithm for zero-sum Markov games, and designing the first line of provably sample-efficient algorithms for multi-player general-sum Markov games.

View on arXiv
Comments on this paper