In a classification problem, where the competing classes are not of comparable size, many popular classifiers exhibit a bias towards larger classes, and the nearest neighbor classifier is no exception. To take care of this problem, in this article, we develop a statistical method for nearest neighbor classification based on such imbalanced data sets. First, we construct a classifier for the binary classification problem and then extend it for classification problems involving more than two classes. Unlike the existing oversampling methods, our proposed classifiers do not need to generate any pseudo observations, and hence the results are exactly reproducible. We establish the Bayes risk consistency of these classifiers under appropriate regularity conditions. Their superior performance over the exiting methods is amply demonstrated by analyzing several benchmark data sets.
翻译:在一个分类问题中,当相互竞争的类别规模不同时,许多流行的分类师对较大的类别表现出偏向,而最近的近邻分类师也不例外。为了解决这一问题,我们在本条中根据这种不平衡的数据集,为最近的邻居分类制定了统计方法。首先,我们为二进制分类问题建立一个分类师,然后将分类方法扩大到涉及两个以上类别的分类问题。与现有的过度抽样方法不同,我们提议的分类师不需要产生任何虚假的观察,因此结果完全可以复制。我们在适当的常规条件下确定这些分类师在贝耶斯的风险一致性。分析几个基准数据集充分证明了这些分类师的优异性。