403

Variance-Aware Confidence Set: Variance-Dependent Bound for Linear Bandits and Horizon-Free Bound for Linear Mixture MDP

Neural Information Processing Systems (NeurIPS), 2021
Abstract

We show how to construct variance-aware confidence sets for linear bandits and linear mixture Markov Decision Process (MDP). Our method yields the following new regret bounds: * For linear bandits, we obtain an O~(poly(d)1+i=1Kσi2)\widetilde{O}(\mathrm{poly}(d)\sqrt{1 + \sum_{i=1}^{K}\sigma_i^2}) regret bound, where dd is the feature dimension, KK is the number of rounds, and σi2\sigma_i^2 is the (unknown) variance of the reward at the ii-th round. This is the first regret bound that only scales with the variance and the dimension, with no explicit polynomial dependency on KK. * For linear mixture MDP, we obtain an O~(poly(d,logH)K)\widetilde{O}(\mathrm{poly}(d, \log H)\sqrt{K}) regret bound, where dd is the number of base models, KK is the number of episodes, and HH is the planning horizon. This is the first regret bound that only scales logarithmically with HH in the reinforcement learning with linear function approximation setting, thus exponentially improving existing results. Our methods utilize three novel ideas that may be of independent interest: 1) applications of the peeling techniques to the norm of input and the magnitude of variance, 2) a recursion-based approach to estimate the variance, and 3) a convex potential lemma that somewhat generalizes the seminal elliptical potential lemma.

View on arXiv
Comments on this paper