Variance-Aware Confidence Set: Variance-Dependent Bound for Linear
Bandits and Horizon-Free Bound for Linear Mixture MDP
We show how to construct variance-aware confidence sets for linear bandits and linear mixture Markov Decision Process (MDP). Our method yields the following new regret bounds: * For linear bandits, we obtain an regret bound, where is the feature dimension, is the number of rounds, and is the (unknown) variance of the reward at the -th round. This is the first regret bound that only scales with the variance and the dimension, with no explicit polynomial dependency on . * For linear mixture MDP, we obtain an regret bound, where is the number of base models, is the number of episodes, and is the planning horizon. This is the first regret bound that only scales logarithmically with in the reinforcement learning with linear function approximation setting, thus exponentially improving existing results. Our methods utilize three novel ideas that may be of independent interest: 1) applications of the peeling techniques to the norm of input and the magnitude of variance, 2) a recursion-based approach to estimate the variance, and 3) a convex potential lemma that somewhat generalizes the seminal elliptical potential lemma.
View on arXiv