An Unbiased, Data-Driven, Offline Evaluation Method of Contextual Bandit Algorithms

Web Search and Data Mining (WSDM), 2010

31 March 2010

Abstract

Offline evaluation of reinforcement learning algorithms based on collected data (state transitions and rewards) has remained a challenging problem. Common practice is to create a simulator based on collected data and then run the algorithm against this simulator. Such an approach involves creating a simulator of the problem at hand, which is often difficult and may introduce bias to the evaluation results. In this paper, we introduce an offline evaluation method for a subclass of reinforcement learning problems known as contextual bandits. This method is completely driven by data, does not require building a simulator, and gives provably unbiased evaluation results. Its effectiveness is also empirically validated using a large-scale news article recommendation dataset collected from Yahoo! Frontpage.

View on arXiv

Comments on this paper