The $k$ nearest neighbor algorithm ($k$NN) is one of the most popular nonparametric methods used for various purposes, such as treatment effect estimation, missing value imputation, classification, and clustering. The main advantage of $k$NN is its simplicity of hyperparameter optimization. It often produces favorable results with minimal effort. This paper proposes a generic semiparametric (or nonparametric if required) approach named Local Resampler (LR). LR utilizes $k$NN to create subsamples from the original sample and then generates synthetic values that are drawn from locally estimated distributions. LR can accurately create synthetic samples, even if the original sample has a non-convex distribution. Moreover, LR shows better or similar performance to other popular synthetic data methods with minimal model optimization with parametric distributional assumptions.
翻译:$k$最近的近邻算法(k$NN)是用于各种目的的最受欢迎的非参数方法之一,例如治疗效果估计、缺失值估算、分类和组群。$k$NN的主要好处是超参数优化的简单性。它通常能产生最有利的结果,但很少努力。本文建议采用通用半参数(或必要时不参数)方法,称为当地Resampler(LR)。 LR利用$kNNN从原始样本中创建子样本,然后生成从当地估计分布中提取的合成值。 LR可以准确地创建合成样品,即使原样本的分布是非convex的。此外,LR显示的性能优或相似于其他流行的合成数据方法,只有最低的模型优化和参数分布假设。