Learning Mixtures of Arbitrary Distributions over Large Discrete Domains

7 December 2012

Abstract

We give an algorithm for learning a mixture of unstructured distributions. This problem arises in various unsupervised learning scenarios, for example in learning topic models from a corpus of documents spanning several topics. We show how to learn the constituents (the topic distributions and the mixture weights) of a mixture of $k$ (constant) arbitrary distributions over a large discrete domain $[n]={1,2,...,n}$ , using $O(n\polylog n)$ samples. This task is information-theoretically impossible for $k>1$ under the usual sampling process from a mixture distribution. However, there are situations (such as the above-mentioned topic model case) in which each sample point consists of several observations from the same mixture constituent. This number of observations, which we call the "sampling aperture", is a crucial parameter of the problem. We show that efficient learning is possible exactly at the information-theoretically least-possible aperture of $2k-1$ . (Independent work by others places certain restrictions on the model, which enables learning with smaller aperture, albeit using, in general, a significantly larger sample size.) A sequence of tools contribute to the algorithm, such as concentration results for random matrices, dimension reduction, moment estimations, and sensitivity analysis.

View on arXiv

Comments on this paper