13
3

Efficient Distributed Algorithms for the KK-Nearest Neighbors Problem

Abstract

The KK-nearest neighbors is a basic problem in machine learning with numerous applications. In this problem, given a (training) set of nn data points with labels and a query point pp, we want to assign a label to pp based on the labels of the KK-nearest points to the query. We study this problem in the {\em kk-machine model}, (Note that parameter kk stands for the number of machines in the kk-machine model and is independent of KK-nearest points.) a model for distributed large-scale data. In this model, we assume that the nn points are distributed (in a balanced fashion) among the kk machines and the goal is to quickly compute answer given a query point to a machine. Our main result is a simple randomized algorithm in the kk-machine model that runs in O(logK)O(\log K) communication rounds with high probability success (regardless of the number of machines kk and the number of points nn). The message complexity of the algorithm is small taking only O(klogK)O(k\log K) messages. Our bounds are essentially the best possible for comparison-based algorithms (Algorithms that use only comparison operations (,,=\leq, \geq, =) between elements to distinguish the ordering among them). This is due to the existence of a lower bound of Ω(logn)\Omega(\log n) communication rounds for finding the {\em median} of 2n2n elements distributed evenly among two processors by Rodeh \cite{rodeh}. We also implemented our algorithm and show that it performs well compared to an algorithm (used in practice) that sends KK nearest points from each machine to a single machine which then computes the answer.

View on arXiv
Comments on this paper