230

Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning

Neural Information Processing Systems (NeurIPS), 2022
Abstract

In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. The multi-batch reinforcement learning framework, where the agent is required to provide a time schedule to update policy before everything, which is particularly suitable for the scenarios where the agent suffers extensively from changing the policy adaptively. Given a finite-horizon MDP with SS states, AA actions and planning horizon HH, we design a computational efficient algorithm to achieve near-optimal regret of O~(SAH3Kln(1/δ))\tilde{O}(\sqrt{SAH^3K\ln(1/\delta)})\footnote{O~()\tilde{O}(\cdot) hides logarithmic terms of (S,A,H,K)(S,A,H,K)} in KK episodes using O(H+log2log2(K))O\left(H+\log_2\log_2(K) \right) batches with confidence parameter δ\delta. To our best of knowledge, it is the first O~(SAH3K)\tilde{O}(\sqrt{SAH^3K}) regret bound with O(H+log2log2(K))O(H+\log_2\log_2(K)) batch complexity. Meanwhile, we show that to achieve O~(poly(S,A,H)K)\tilde{O}(\mathrm{poly}(S,A,H)\sqrt{K}) regret, the number of batches is at least Ω(H/logA(K)+log2log2(K))\Omega\left(H/\log_A(K)+ \log_2\log_2(K) \right), which matches our upper bound up to logarithmic terms. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model.

View on arXiv
Comments on this paper