Asymptotic distribution of principal component scores for pervasive, high-dimensional eigenvectors

Principal component analysis (PCA) is a widely used technique for dimension reduction, also for high-dimensional data. In the high-dimensional framework, PCA is not asymptotically consistent, as sample eigenvectors do not converge to the population eigenvectors. However, in this paper it is shown that for a pervasive signal, the visual content of the sample principal component (PC) scores will be the same as for the population PC scores. The asymptotic distribution of the ratio between the individual sample and population scores is derived, assuming that eigenvalues scale linearly with the dimension. The distribution of the ratio consists of a main shift and a noise part, where the main shift does not depend on the individual scores. As a consequence, all sample scores are affected by an approximate common scaling, such that the relative positions of the population scores are kept. Simulations show that the noise part is negligible for the purpose of visualization, for small to moderate sample sizes depending on the signal strength. The realism of the eigenvalue assumption is supported by introducing the pervasive signal structure, where the number of non-zero effects is a non-vanishing proportion of the total number of variables. If an eigenvectors is pervasive and fixed, we show that the corresponding eigenvalue will scale linearly with the dimension. Two data examples from genomics, where pervasiveness is reasonable, are discussed.
View on arXiv