463
v1v2v3 (latest)

Towards a Sharp Analysis of Offline Policy Learning for ff-Divergence-Regularized Contextual Bandits

Main:10 Pages
3 Figures
Bibliography:5 Pages
1 Tables
Appendix:20 Pages
Abstract

Many offline reinforcement learning algorithms are underpinned by ff-divergence regularization, but their sample complexity *defined with respect to regularized objectives* still lacks tight analyses, especially in terms of concrete data coverage conditions. In this paper, we study the exact concentrability requirements to achieve the Θ~(ϵ1)\tilde{\Theta}(\epsilon^{-1}) sample complexity for offline ff-divergence-regularized contextual bandits. For reverse Kullback-Leibler (KL) divergence, arguably the most commonly used one, we achieve an O~(ϵ1)\tilde{O}(\epsilon^{-1}) sample complexity under single-policy concentrability for the first time via a novel pessimism-based analysis, surpassing existing O~(ϵ1)\tilde{O}(\epsilon^{-1}) bound under all-policy concentrability and O~(ϵ2)\tilde{O}(\epsilon^{-2}) bound under single-policy concentrability. We also propose a near-matching lower bound, demonstrating that a multiplicative dependency on single-policy concentrability is necessary to maximally exploit the curvature property of reverse KL. Moreover, for ff-divergences with strongly convex ff, to which reverse KL *does not* belong, we show that the sharp sample complexity Θ~(ϵ1)\tilde{\Theta}(\epsilon^{-1}) is achievable even without pessimistic estimation or single-policy concentrability. We further corroborate our theoretical insights with numerical experiments and extend our analysis to contextual dueling bandits. We believe these results take a significant step towards a comprehensive understanding of objectives with ff-divergence regularization.

View on arXiv
Comments on this paper