156

Learning to Use Learners' Advice

Abstract

In this paper, we study a variant of the framework of online learning using expert advice with limited/bandit feedback---we consider each expert a learning entity and thereby capture scenarios that are more realistic and practical for real-world applications. In our setting, the feedback at any time tt is limited in a sense that it is only available to the expert iti^t that has been selected by the central algorithm (forecaster), i.e., only the expert iti^t receives feedback from the environment and gets to learn at time tt. We consider a generic black-box approach whereby the forecaster doesn't control or know the learning dynamics of the experts apart from knowing the following no-regret learning property: the average regret of any expert jj vanishes at a rate of at least O(tjβ1)O(t_j^{\beta-1}) with tjt_j learning steps where β[0,1]\beta \in [0, 1] is a parameter. We prove the following hardness result: without any coordination between the forecaster and the experts, it is impossible to design a forecaster achieving no-regret guarantees in the worst-case. In order to circumvent this hardness result, we consider a practical assumption allowing the forecaster to "guide" the learning process of the experts by filtering/blocking some of the feedbacks observed by them from the environment, i.e., not allowing the selected expert iti^t to learn at time tt for some time steps. Then, we design a novel no-regret learning algorithm \algo for this problem setting by carefully guiding the feedbacks observed by experts. We prove that \algo achieves the worst-case expected cumulative regret of O(T12β)O(T^\frac{1}{2 - \beta}) after TT time steps and matches the regret bound of Θ(T12)\Theta(T^\frac{1}{2}) for the special case of multi-armed bandits.

View on arXiv
Comments on this paper