ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.13743
35
3

Optimistic Q-learning for average reward and episodic reinforcement learning

18 July 2024
Priyank Agrawal
Shipra Agrawal
ArXivPDFHTML
Abstract

We present an optimistic Q-learning algorithm for regret minimization in average reward reinforcement learning under an additional assumption on the underlying MDP that for all policies, the time to visit some frequent state s0s_0s0​ is finite and upper bounded by HHH, either in expectation or with constant probability. Our setting strictly generalizes the episodic setting and is significantly less restrictive than the assumption of bounded hitting time \textit{for all states} made by most previous literature on model-free algorithms in average reward settings. We demonstrate a regret bound of O~(H5SAT)\tilde{O}(H^5 S\sqrt{AT})O~(H5SAT​), where SSS and AAA are the numbers of states and actions, and TTT is the horizon. A key technical novelty of our work is the introduction of an L‾\overline{L}L operator defined as L‾v=1H∑h=1HLhv\overline{L} v = \frac{1}{H} \sum_{h=1}^H L^h vLv=H1​∑h=1H​Lhv where LLL denotes the Bellman operator. Under the given assumption, we show that the L‾\overline{L}L operator has a strict contraction (in span) even in the average-reward setting where the discount factor is 111. Our algorithm design uses ideas from episodic Q-learning to estimate and apply this operator iteratively. Thus, we provide a unified view of regret minimization in episodic and non-episodic settings, which may be of independent interest.

View on arXiv
@article{agrawal2025_2407.13743,
  title={ Optimistic Q-learning for average reward and episodic reinforcement learning },
  author={ Priyank Agrawal and Shipra Agrawal },
  journal={arXiv preprint arXiv:2407.13743},
  year={ 2025 }
}
Comments on this paper