243

k-Nearest Neighbour Classification of Datasets with a Family of Distances

Abstract

The kk-nearest neighbour (kk-NN) classifier is one of the oldest and most important supervised learning algorithms for classifying datasets. Traditionally the Euclidean norm is used as the distance for the kk-NN classifier. In this thesis we investigate the use of alternative distances for the kk-NN classifier. We start by introducing some background notions in statistical machine learning. We define the kk-NN classifier and discuss Stone's theorem and the proof that kk-NN is universally consistent on the normed space RdR^d. We then prove that kk-NN is universally consistent if we take a sequence of random norms (that are independent of the sample and the query) from a family of norms that satisfies a particular boundedness condition. We extend this result by replacing norms with distances based on uniformly locally Lipschitz functions that satisfy certain conditions. We discuss the limitations of Stone's lemma and Stone's theorem, particularly with respect to quasinorms and adaptively choosing a distance for kk-NN based on the labelled sample. We show the universal consistency of a two stage kk-NN type classifier where we select the distance adaptively based on a split labelled sample and the query. We conclude by giving some examples of improvements of the accuracy of classifying various datasets using the above techniques.

View on arXiv
Comments on this paper