96

Regret Bounds for Adversarial Contextual Bandits with General Function Approximation and Delayed Feedback

Main:10 Pages
Bibliography:3 Pages
Appendix:12 Pages
Abstract

We present regret minimization algorithms for the contextual multi-armed bandit (CMAB) problem over KK actions in the presence of delayed feedback, a scenario where loss observations arrive with delays chosen by an adversary. As a preliminary result, assuming direct access to a finite policy class Π\Pi we establish an optimal expected regret bound of $ O (\sqrt{KT \log |\Pi|} + \sqrt{D \log |\Pi|)} $ where DD is the sum of delays. For our main contribution, we study the general function approximation setting over a (possibly infinite) contextual loss function class $ \mathcal{F} $ with access to an online least-square regression oracle O\mathcal{O} over F\mathcal{F}. In this setting, we achieve an expected regret bound of O(KTRT(O)+dmaxDβ)O(\sqrt{KT\mathcal{R}_T(\mathcal{O})} + \sqrt{ d_{\max} D \beta}) assuming FIFO order, where dmaxd_{\max} is the maximal delay, RT(O)\mathcal{R}_T(\mathcal{O}) is an upper bound on the oracle's regret and β\beta is a stability parameter associated with the oracle. We complement this general result by presenting a novel stability analysis of a Hedge-based version of Vovk's aggregating forecaster as an oracle implementation for least-square regression over a finite function class F\mathcal{F} and show that its stability parameter β\beta is bounded by logF\log |\mathcal{F}|, resulting in an expected regret bound of O(KTlogF+dmaxDlogF)O(\sqrt{KT \log |\mathcal{F}|} + \sqrt{d_{\max} D \log |\mathcal{F}|}) which is a dmax\sqrt{d_{\max}} factor away from the lower bound of Ω(KTlogF+DlogF)\Omega(\sqrt{KT \log |\mathcal{F}|} + \sqrt{D \log |\mathcal{F}|}) that we also present.

View on arXiv
Comments on this paper