6
34

Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation

Abstract

We study reinforcement learning with linear function approximation where the transition probability and reward functions are linear with respect to a feature mapping ϕ(s,a)\boldsymbol{\phi}(s,a). Specifically, we consider the episodic inhomogeneous linear Markov Decision Process (MDP), and propose a novel computation-efficient algorithm, LSVI-UCB+^+, which achieves an O~(HdT)\widetilde{O}(Hd\sqrt{T}) regret bound where HH is the episode length, dd is the feature dimension, and TT is the number of steps. LSVI-UCB+^+ builds on weighted ridge regression and upper confidence value iteration with a Bernstein-type exploration bonus. Our statistical results are obtained with novel analytical tools, including a new Bernstein self-normalized bound with conservatism on elliptical potentials, and refined analysis of the correction term. This is a minimax optimal algorithm for linear MDPs up to logarithmic factors, which closes the Hd\sqrt{Hd} gap between the upper bound of O~(H3d3T)\widetilde{O}(\sqrt{H^3d^3T}) in (Jin et al., 2020) and lower bound of Ω(HdT)\Omega(Hd\sqrt{T}) for linear MDPs.

View on arXiv
Comments on this paper