An Unbiased, Data-Driven, Offline Evaluation Method of Contextual Bandit
Algorithms
- OffRL
Offline evaluation of reinforcement learning algorithms based on collected data (state transitions and rewards) has remained a challenging problem. Common practice is to create a simulator based on collected data and then run the algorithm against this simulator. Such an approach involves creating a simulator of the problem at hand, which is often difficult and may introduce bias to the evaluation results. In this paper, we introduce an offline evaluation method for a subclass of reinforcement learning problems known as contextual bandits. This method is completely driven by data, does not require building a simulator, and gives provably unbiased evaluation results. Its effectiveness is also empirically validated using a large-scale news article recommendation dataset collected from Yahoo! Frontpage.
View on arXiv