ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2102.09703
21
7

Near-Optimal Randomized Exploration for Tabular Markov Decision Processes

19 February 2021
Zhihan Xiong
Ruoqi Shen
Qiwen Cui
Maryam Fazel
S. Du
ArXivPDFHTML
Abstract

We study algorithms using randomized value functions for exploration in reinforcement learning. This type of algorithms enjoys appealing empirical performance. We show that when we use 1) a single random seed in each episode, and 2) a Bernstein-type magnitude of noise, we obtain a worst-case O~(HSAT)\widetilde{O}\left(H\sqrt{SAT}\right)O(HSAT​) regret bound for episodic time-inhomogeneous Markov Decision Process where SSS is the size of state space, AAA is the size of action space, HHH is the planning horizon and TTT is the number of interactions. This bound polynomially improves all existing bounds for algorithms based on randomized value functions, and for the first time, matches the Ω(HSAT)\Omega\left(H\sqrt{SAT}\right)Ω(HSAT​) lower bound up to logarithmic factors. Our result highlights that randomized exploration can be near-optimal, which was previously achieved only by optimistic algorithms. To achieve the desired result, we develop 1) a new clipping operation to ensure both the probability of being optimistic and the probability of being pessimistic are lower bounded by a constant, and 2) a new recursive formula for the absolute value of estimation errors to analyze the regret.

View on arXiv
Comments on this paper