352
v1v2 (latest)

Worst-Case Analysis for Randomly Collected Data

Abstract

We introduce a framework for statistical estimation that leverages knowledge of how samples are collected but makes no distributional assumptions on the data values. Specifically, we consider a population of elements [n]=1,,n[n]={1,\ldots,n} with corresponding data values x1,,xnx_1,\ldots,x_n. We observe the values for a "sample" set A[n]A \subset [n] and wish to estimate some statistic of the values for a "target" set B[n]B \subset [n] where BB could be the entire set. Crucially, we assume that the sets AA and BB are drawn according to some known distribution PP over pairs of subsets of [n][n]. A given estimation algorithm is evaluated based on its "worst-case, expected error" where the expectation is with respect to the distribution PP from which the sample AA and target sets BB are drawn, and the worst-case is with respect to the data values x1,,xnx_1,\ldots,x_n. Within this framework, we give an efficient algorithm for estimating the target mean that returns a weighted combination of the sample values--where the weights are functions of the distribution PP and the sample and target sets AA, BB--and show that the worst-case expected error achieved by this algorithm is at most a multiplicative π/2\pi/2 factor worse than the optimal of such algorithms. The algorithm and proof leverage a surprising connection to the Grothendieck problem. This framework, which makes no distributional assumptions on the data values but rather relies on knowledge of the data collection process, is a significant departure from typical estimation and introduces a uniform algorithmic analysis for the many natural settings where membership in a sample may be correlated with data values, such as when sampling probabilities vary as in "importance sampling", when individuals are recruited into a sample via a social network as in "snowball sampling", or when samples have chronological structure as in "selective prediction".

View on arXiv
Comments on this paper