20
8

Relational Algorithms for k-means Clustering

Abstract

The majority of learning tasks faced by data scientists involve relational data, yet most standard algorithms for standard learning problems are not designed to accept relational data as input. The standard practice to address this issue is to join the relational data to create the type of geometric input that standard learning algorithms expect. Unfortunately, this standard practice has exponential worst-case time and space complexity. This leads us to consider what we call the Relational Learning Question: ``Which standard learning algorithms can be efficiently implemented on relational data, and for those that can not, is there an alternative algorithm that can be efficiently implemented on relational data and that has similar performance guarantees to the standard algorithm?'' In this paper, we address the relational learning question for two well-known algorithms for the standard kk-means clustering problem. We first show that the kk-means++ algorithm can be efficiently implemented on relational data. In contrast, we show that the adaptive kk-means algorithm likely can not be efficiently implemented on relational data, as this would imply P=#PP = \#P. However, we show that a slight variation of this adaptive kk-means algorithm can be efficiently implemented on relational data, and that this alternative algorithm has the same performance guarantee as the original algorithm, that is that it outputs an O(1)O(1)-approximate sketch.

View on arXiv
Comments on this paper