88

kk-PCA for (non-squared) Euclidean Distances: Polynomial Time Approximation

Main:20 Pages
4 Figures
Bibliography:5 Pages
Abstract

Given an integer k1k\geq1 and a set PP of nn points in \REALd\REAL^d, the classic kk-PCA (Principle Component Analysis) approximates the affine \emph{kk-subspace mean} of PP, which is the kk-dimensional affine linear subspace that minimizes its sum of squared Euclidean distances (2,2\ell_{2,2}-norm) over the points of PP, i.e., the mean of these distances. The \emph{kk-subspace median} is the subspace that minimizes its sum of (non-squared) Euclidean distances (2,1\ell_{2,1}-mixed norm), i.e., their median. The median subspace is usually more sparse and robust to noise/outliers than the mean, but also much harder to approximate since, unlike the z,z\ell_{z,z} (non-mixed) norms, it is non-convex for k<d1k<d-1.We provide the first polynomial-time deterministic algorithm whose both running time and approximation factor are not exponential in kk. More precisely, the multiplicative approximation factor is d\sqrt{d}, and the running time is polynomial in the size of the input. We expect that our technique would be useful for many other related problems, such as 2,z\ell_{2,z} norm of distances for z∉\br1,2z\not \in \br{1,2}, e.g., z=z=\infty, and handling outliers/sparsity.Open code and experimental results on real-world datasets are also provided.

View on arXiv
Comments on this paper