Can We Reconcile Robustness and Efficiency in Unsupervised Learning?

Annual Conference Computational Learning Theory (COLT), 2012

5 November 2012

Abstract

We consider a fundamental problem in unsupervised learning: given a collection of $m$ points in $\R^n$ , if many but not necessarily all of these points are contained in a $d$ -dimensional subspace $T$ can we find it? The points contained in $T$ are called {\em inliers} and the remaining points are {\em outliers}. This problem has received considerable attention in computer science and in statistics. Yet efficient algorithms from computer science are not robust to {\em adversarial} outliers, and the estimators from robust statistics are hard to compute in high dimensions. This is a serious and persistent issue not just in this application, but for many other problems in unsupervised learning. Are there algorithms for linear regression that are both robust to outliers and efficient? We give an algorithm that finds $T$ when it contains more than a $\frac{d}{n}$ fraction of the points. Hence, for say $d = n/2$ this estimator is both easy to compute and well-behaved when there are a constant fraction of outliers. We prove that it is small set expansion hard to find $T$ when the fraction of errors is any larger and so our estimator is an {\em optimal} compromise between efficiency and robustness. In fact, this basic problem has a surprising number of connections to other areas including small set expansion, matroid theory and functional analysis that we make use of here.

View on arXiv

Comments on this paper