We present an optimistic Q-learning algorithm for regret minimization in average reward reinforcement learning under an additional assumption on the underlying MDP that for all policies, the time to visit some frequent state is finite and upper bounded by , either in expectation or with constant probability. Our setting strictly generalizes the episodic setting and is significantly less restrictive than the assumption of bounded hitting time \textit{for all states} made by most previous literature on model-free algorithms in average reward settings. We demonstrate a regret bound of , where and are the numbers of states and actions, and is the horizon. A key technical novelty of our work is the introduction of an operator defined as where denotes the Bellman operator. Under the given assumption, we show that the operator has a strict contraction (in span) even in the average-reward setting where the discount factor is . Our algorithm design uses ideas from episodic Q-learning to estimate and apply this operator iteratively. Thus, we provide a unified view of regret minimization in episodic and non-episodic settings, which may be of independent interest.
View on arXiv@article{agrawal2025_2407.13743, title={ Optimistic Q-learning for average reward and episodic reinforcement learning }, author={ Priyank Agrawal and Shipra Agrawal }, journal={arXiv preprint arXiv:2407.13743}, year={ 2025 } }