Generating Synthetic Data with The Nearest Neighbors Algorithm

3 October 2022

Abstract

The $k$ nearest neighbor algorithm ( $k$ NN) is one of the most popular nonparametric methods used for various purposes, such as treatment effect estimation, missing value imputation, classification, and clustering. The main advantage of $k$ NN is its simplicity of hyperparameter optimization. It often produces favorable results with minimal effort. This paper proposes a generic semiparametric (or nonparametric if required) approach named Local Resampler (LR). LR utilizes $k$ NN to create subsamples from the original sample and then generates synthetic values that are drawn from locally estimated distributions. LR can accurately create synthetic samples, even if the original sample has a non-convex distribution. Moreover, LR shows better or similar performance to other popular synthetic data methods with minimal model optimization with parametric distributional assumptions.

View on arXiv

Comments on this paper