107

Improved Outlier Robust Seeding for k-means

Abstract

The kk-means is a popular clustering objective, although it is inherently non-robust and sensitive to outliers. Its popular seeding or initialization called kk-means++ uses D2D^{2} sampling and comes with a provable O(logk)O(\log k) approximation guarantee \cite{AV2007}. However, in the presence of adversarial noise or outliers, D2D^{2} sampling is more likely to pick centers from distant outliers instead of inlier clusters, and therefore its approximation guarantees \textit{w.r.t.} kk-means solution on inliers, does not hold. Assuming that the outliers constitute a constant fraction of the given data, we propose a simple variant in the D2D^2 sampling distribution, which makes it robust to the outliers. Our algorithm runs in O(ndk)O(ndk) time, outputs O(k)O(k) clusters, discards marginally more points than the optimal number of outliers, and comes with a provable O(1)O(1) approximation guarantee. Our algorithm can also be modified to output exactly kk clusters instead of O(k)O(k) clusters, while keeping its running time linear in nn and dd. This is an improvement over previous results for robust kk-means based on LP relaxation and rounding \cite{Charikar}, \cite{KrishnaswamyLS18} and \textit{robust kk-means++} \cite{DeshpandeKP20}. Our empirical results show the advantage of our algorithm over kk-means++~\cite{AV2007}, uniform random seeding, greedy sampling for kk means~\cite{tkmeanspp}, and robust kk-means++~\cite{DeshpandeKP20}, on standard real-world and synthetic data sets used in previous work. Our proposal is easily amenable to scalable, faster, parallel implementations of kk-means++ \cite{Bahmani,BachemL017} and is of independent interest for coreset constructions in the presence of outliers \cite{feldman2007ptas,langberg2010universal,feldman2011unified}.

View on arXiv
Comments on this paper