468

Towards a Sharp Analysis of Offline Policy Learning for ff-Divergence-Regularized Contextual Bandits

Main:10 Pages
3 Figures
Bibliography:5 Pages
1 Tables
Appendix:20 Pages
Abstract

Although many popular reinforcement learning algorithms are underpinned by ff-divergence regularization, their sample complexity with respect to the \emph{regularized objective} still lacks a tight characterization. In this paper, we analyze ff-divergence-regularized offline policy learning. For reverse Kullback-Leibler (KL) divergence, arguably the most commonly used one, we give the first O~(ϵ1)\tilde{O}(\epsilon^{-1}) sample complexity under single-policy concentrability for contextual bandits, surpassing existing O~(ϵ1)\tilde{O}(\epsilon^{-1}) bound under all-policy concentrability and O~(ϵ2)\tilde{O}(\epsilon^{-2}) bound under single-policy concentrability. Our analysis for general function approximation leverages the principle of pessimism in the face of uncertainty to refine a mean-value-type argument to its extreme. This in turn leads to a novel moment-based technique, effectively bypassing the need for uniform control over the discrepancy between any two functions in the function class. We further propose a lower bound, demonstrating that a multiplicative dependency on single-policy concentrability is necessary to maximally exploit the strong convexity of reverse KL. In addition, for ff-divergences with strongly convex ff, to which reverse KL \emph{does not} belong, we show that the sharp sample complexity Θ~(ϵ1)\tilde{\Theta}(\epsilon^{-1}) is achievable even without single-policy concentrability. In this case, the algorithm design can get rid of pessimistic estimators. We further extend our analysis to dueling bandits, and we believe these results take a significant step toward a comprehensive understanding of ff-divergence-regularized policy learning.

View on arXiv
Comments on this paper