The -principal component analysis (-PCA) problem is a fundamental algorithmic primitive that is widely-used in data analysis and dimensionality reduction applications. In statistical settings, the goal of -PCA is to identify a top eigenspace of the covariance matrix of a distribution, which we only have black-box access to via samples. Motivated by these settings, we analyze black-box deflation methods as a framework for designing -PCA algorithms, where we model access to the unknown target matrix via a black-box -PCA oracle which returns an approximate top eigenvector, under two popular notions of approximation. Despite being arguably the most natural reduction-based approach to -PCA algorithm design, such black-box methods, which recursively call a -PCA oracle times, were previously poorly-understood. Our main contribution is significantly sharper bounds on the approximation parameter degradation of deflation methods for -PCA. For a quadratic form notion of approximation we term ePCA (energy PCA), we show deflation methods suffer no parameter loss. For an alternative well-studied approximation notion we term cPCA (correlation PCA), we tightly characterize the parameter regimes where deflation methods are feasible. Moreover, we show that in all feasible regimes, -cPCA deflation algorithms suffer no asymptotic parameter loss for any constant . We apply our framework to obtain state-of-the-art -PCA algorithms robust to dataset contamination, improving prior work in sample complexity by a factor.
View on arXiv