Statistical-Computational Tradeoffs in Planted Problems and Submatrix
Localization with a Growing Number of Clusters and Submatrices
We consider two closely related problems: planted clustering and submatrix localization. The planted clustering model assumes that a graph is generated from some unknown clusters by randomly placing edges between nodes according to their cluster memberships; the task is to recover the clusters given the graph. Special cases include the classical planted clique, planted densest subgraph, planted partition and planted coloring problems. In the submatrix localization problem, also known as bi-clustering, the goal is to locate hidden submatrices with elevated means inside a large random matrix. Of particular interest is the setting where the number of clusters/submatrices is allowed to grow unbounded with the problem size. We consider both the statistical and computational aspects of these two problems, and prove the following. The space of the model parameters can be partitioned into four disjoint regions corresponding to decreasing statistical and computational complexities: (1) the "impossible" regime, where all algorithms fail; (2) the "hard" regime, where the exponential-time Maximum Likelihood Estimator (MLE) succeeds; (3) the "easy" regime, where the polynomial-time convexified MLE succeeds; (4) the "simple" regime, where a simple counting/thresholding procedure succeeds. Moreover, we show that each of these algorithms provably fails in the previous harder regimes. Our theorems establish the first minimax recovery results for the two problems with unbounded numbers of clusters/submatrices, and provide the best known guarantees achievable by polynomial-time algorithms. These results demonstrate the tradeoffs between statistical and computational considerations.
View on arXiv