116
89

Do Semidefinite Relaxations Really Solve Sparse PCA?

Abstract

Estimating the leading principal components of data assuming they are sparse, is a central task in modern high-dimensional statistics. Many algorithms were suggested for this sparse PCA problem, from simple diagonal thresholding to sophisticated semidefinite programming (SDP) methods. A key theoretical question asks under what conditions can such algorithms recover the sparse principal components. We study this question for a single-spike model, with a spike that is 0\ell_0-sparse, and dimension pp and sample size nn that tend to infinity. Amini and Wainwright (2009) proved that for sparsity levels kΩ(n/logp)k\geq\Omega(n/\log p), no algorithm, efficient or not, can reliably recover the sparse eigenvector. In contrast, for sparsity levels kO(n/logp)k\leq O(\sqrt{n/\log p}), diagonal thresholding is asymptotically consistent. It was further conjectured that the SDP approach may close this gap between computational and information limits. We prove that when kΩ(n)k \geq \Omega(\sqrt{n}) the SDP approach, at least in its standard usage, cannot recover the sparse spike. In fact, we conjecture that in the single-spike model, no computationally-efficient algorithm can recover a spike of 0\ell_0-sparsity kΩ(n)k\geq \Omega(\sqrt{n}). Finally, we present empirical results suggesting that up to sparsity levels k=O(n)k=O(\sqrt{n}), recovery is possible by a simple covariance thresholding algorithm.

View on arXiv
Comments on this paper