Minimax Confidence Interval for Off-Policy Evaluation and Policy Optimization

6 February 2020

Abstract

We study minimax methods for off-policy evaluation (OPE) using value-functions and marginalized importance weights. Despite that they hold promises of overcoming the exponential variance in traditional importance sampling, several key problems remain: (1) They require function approximation and are generally biased. For the sake of trustworthy OPE, is there any way to quantify the biases? (2) They are split into two styles ("weight-learning" vs "value-learning"). Can we unify them? In this paper we answer both questions positively. By slightly altering the derivation of previous methods (one from each style; Uehara et al., 2019), we unify them into a single confidence interval(CI) that comes with a special type of double robustness: when either the value-function or importance weight class is well-specified, the CI is valid and its length quantifies the misspecification of the other class. Our CI also provides a unified view of and new insights to some recent methods, and we further explore the implications of our results on exploration and exploitation in off-policy policy optimization with insufficient data coverage.

View on arXiv

Comments on this paper