ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2112.13386
143
0

Reducing Planning Complexity of General Reinforcement Learning with Non-Markovian Abstractions

26 December 2021
Sultan Javed Majeed
Marcus Hutter
    OffRL
ArXiv (abs)PDFHTML
Abstract

The field of General Reinforcement Learning (GRL) formulates the problem of sequential decision-making from ground up. The history of interaction constitutes a "ground" state of the system, which never repeats. On the one hand, this generality allows GRL to model almost every domain possible, e.g.\ Bandits, MDPs, POMDPs, PSRs, and history-based environments. On the other hand, in general, the near-optimal policies in GRL are functions of complete history, which hinders not only learning but also planning in GRL. The usual way around for the planning part is that the agent is given a Markovian abstraction of the underlying process. So, it can use any MDP planning algorithm to find a near-optimal policy. The Extreme State Aggregation (ESA) framework has extended this idea to non-Markovian abstractions without compromising on the possibility of planning through a (surrogate) MDP. A distinguishing feature of ESA is that it proves an upper bound of O(ε−A⋅(1−γ)−2A)O\left(\varepsilon^{-A} \cdot (1-\gamma)^{-2A}\right)O(ε−A⋅(1−γ)−2A) on the number of states required for the surrogate MDP (where AAA is the number of actions, γ\gammaγ is the discount-factor, and ε\varepsilonε is the optimality-gap) which holds \emph{uniformly} for \emph{all} domains. While the possibility of a universal bound is quite remarkable, we show that this bound is very loose. We propose a novel non-MDP abstraction which allows for a much better upper bound of O(ε−1⋅(1−γ)−2⋅A⋅2A)O\left(\varepsilon^{-1} \cdot (1-\gamma)^{-2} \cdot A \cdot 2^{A}\right)O(ε−1⋅(1−γ)−2⋅A⋅2A). Furthermore, we show that this bound can be improved further to O(ε−1⋅(1−γ)−2⋅log⁡3A)O\left(\varepsilon^{-1} \cdot (1-\gamma)^{-2} \cdot \log^3 A \right)O(ε−1⋅(1−γ)−2⋅log3A) by using an action-sequentialization method.

View on arXiv
Comments on this paper