30
2
v1v2 (latest)

Learning Populations of Parameters

Abstract

Consider the following estimation problem: there are nn entities, each with an unknown parameter pi[0,1]p_i \in [0,1], and we observe nn independent random variables, X1,,XnX_1,\ldots,X_n, with XiX_i \sim Binomial(t,pi)(t, p_i). How accurately can one recover the "histogram" (i.e. cumulative density function) of the pip_i's? While the empirical estimates would recover the histogram to earth mover distance Θ(1t)\Theta(\frac{1}{\sqrt{t}}) (equivalently, 1\ell_1 distance between the CDFs), we show that, provided nn is sufficiently large, we can achieve error O(1t)O(\frac{1}{t}) which is information theoretically optimal. We also extend our results to the multi-dimensional parameter case, capturing settings where each member of the population has multiple associated parameters. Beyond the theoretical results, we demonstrate that the recovery algorithm performs well in practice on a variety of datasets, providing illuminating insights into several domains, including politics, sports analytics, and variation in the gender ratio of offspring.

View on arXiv
Comments on this paper