247
5

Fast and Regret Optimal Best Arm Identification: Fundamental Limits and Low-Complexity Algorithms

Abstract

This paper considers a stochastic multi-armed bandit (MAB) problem with dual objectives: (i) quick identification and commitment to the optimal arm, and (ii) reward maximization throughout a sequence of TT consecutive rounds. Though each objective has been individually well-studied, i.e., best arm identification for (i) and regret minimization for (ii), the simultaneous realization of both objectives remains an open problem, despite its practical importance. This paper introduces \emph{Regret Optimal Best Arm Identification} (ROBAI) which aims to achieve these dual objectives. To solve ROBAI with both pre-determined stopping time and adaptive stopping time requirements, we present the EOCP\mathsf{EOCP} algorithm and its variants respectively, which not only achieve asymptotic optimal regret in both Gaussian and general bandits, but also commit to the optimal arm in O(logT)\mathcal{O}(\log T) rounds with pre-determined stopping time and O(log2T)\mathcal{O}(\log^2 T) rounds with adaptive stopping time. We further characterize lower bounds on the commitment time (equivalent to sample complexity) of ROBAI, showing that EOCP\mathsf{EOCP} and its variants are sample optimal with pre-determined stopping time, and almost sample optimal with adaptive stopping time. Numerical results confirm our theoretical analysis and reveal an interesting ``over-exploration'' phenomenon carried by classic UCB\mathsf{UCB} algorithms, such that EOCP\mathsf{EOCP} has smaller regret even though it stops exploration much earlier than UCB\mathsf{UCB} (O(logT)\mathcal{O}(\log T) versus O(T)\mathcal{O}(T)), which suggests over-exploration is unnecessary and potentially harmful to system performance.

View on arXiv
Comments on this paper