Learning to Use Learners' Advice

16 February 2017

Abstract

In this paper, we study a variant of the framework of online learning using expert advice with limited/bandit feedback---we consider each expert a learning entity and thereby capture scenarios that are more realistic and practical for real-world applications. In our setting, the feedback at any time $t$ is limited in a sense that it is only available to the expert $i^t$ that has been selected by the central algorithm (forecaster), i.e., only the expert $i^t$ receives feedback from the environment and gets to learn at time $t$ . We consider a generic black-box approach whereby the forecaster doesn't control or know the learning dynamics of the experts apart from knowing the following no-regret learning property: the average regret of any expert $j$ vanishes at a rate of at least $O(t_j^{\beta-1})$ with $t_j$ learning steps where $\beta \in [0, 1]$ is a parameter. We prove the following hardness result: without any coordination between the forecaster and the experts, it is impossible to design a forecaster achieving no-regret guarantees in the worst-case. In order to circumvent this hardness result, we consider a practical assumption allowing the forecaster to "guide" the learning process of the experts by filtering/blocking some of the feedbacks observed by them from the environment, i.e., not allowing the selected expert $i^t$ to learn at time $t$ for some time steps. Then, we design a novel no-regret learning algorithm \algo for this problem setting by carefully guiding the feedbacks observed by experts. We prove that \algo achieves the worst-case expected cumulative regret of $O(T^\frac{1}{2 - \beta})$ after $T$ time steps and matches the regret bound of $\Theta(T^\frac{1}{2})$ for the special case of multi-armed bandits.

View on arXiv

Comments on this paper