Recovering Structured Probability Matrices

We consider the problem of accurately recovering a matrix B of size M by M , which represents a probability distribution over M^2 outcomes, given access to an observed matrix of "counts" generated by taking independent samples from the distribution B. How can structural properties of the underlying matrix B be leveraged to yield computationally efficient and information theoretically optimal reconstruction algorithms? When can accurate reconstruction be accomplished in the sparse data regime? This basic problem lies at the core of a number of questions that are currently being considered by different communities, including community detection in sparse random graphs, learning structured models such as topic models or hidden Markov models, and the efforts from the natural language processing community to compute "word embeddings". Our results apply to the setting where B has a rank 2 structure. For this setting, we propose an efficient (and practically viable) algorithm that accurately recovers the underlying M by M matrix using Theta(M) samples. This result easily translates to Theta(M) sample algorithms for learning topic models with two topics over dictionaries of size M, and learning hidden Markov Models with two hidden states and observation distributions supported on M elements. These linear sample complexities are optimal, up to constant factors, in an extremely strong sense: even testing basic properties of the underlying matrix (such as whether it has rank 1 or 2) requires Omega(M) samples. This impossibility of sublinear-sample property testing in these settings is intriguing and underscores the significant differences between these structured settings and the standard setting of drawing i.i.d samples from an unstructured distribution of support size M.
View on arXiv