27
1

No-Regret Linear Bandits beyond Realizability

Abstract

We study linear bandits when the underlying reward function is not linear. Existing work relies on a uniform misspecification parameter ϵ\epsilon that measures the sup-norm error of the best linear approximation. This results in an unavoidable linear regret whenever ϵ>0\epsilon > 0. We describe a more natural model of misspecification which only requires the approximation error at each input xx to be proportional to the suboptimality gap at xx. It captures the intuition that, for optimization problems, near-optimal regions should matter more and we can tolerate larger approximation errors in suboptimal regions. Quite surprisingly, we show that the classical LinUCB algorithm -- designed for the realizable case -- is automatically robust against such gap-adjusted misspecification. It achieves a near-optimal T\sqrt{T} regret for problems that the best-known regret is almost linear in time horizon TT. Technically, our proof relies on a novel self-bounding argument that bounds the part of the regret due to misspecification by the regret itself.

View on arXiv
Comments on this paper