To minimize the effect of outliers, kNN ensembles identify a set of closest observations to a new sample point to estimate its unknown class by using majority voting in the labels of the training instances in the neighbourhood. Ordinary kNN based procedures determine k closest training observations in the neighbourhood region (enclosed by a sphere) by using a distance formula. The k nearest neighbours procedure may not work in a situation where sample points in the test data follow the pattern of the nearest observations that lie on a certain path not contained in the given sphere of nearest neighbours. Furthermore, these methods combine hundreds of base kNN learners and many of them might have high classification errors thereby resulting in poor ensembles. To overcome these problems, an optimal extended neighbourhood rule based ensemble is proposed where the neighbours are determined in k steps. It starts from the first nearest sample point to the unseen observation. The second nearest data point is identified that is closest to the previously selected data point. This process is continued until the required number of the k observations are obtained. Each base model in the ensemble is constructed on a bootstrap sample in conjunction with a random subset of features. After building a sufficiently large number of base models, the optimal models are then selected based on their performance on out-of-bag (OOB) data.
翻译:为尽量减少外部线的影响, kNNN 编组将一组最接近的观测结果确定为一个新的抽样点,以便利用附近地区培训情况标签上的多数投票来估计其未知类别。普通 kNN 程序通过使用距离公式确定邻近地区( 由球体封闭) k 最接近的培训观测结果。 k最近的邻居程序可能无法在测试数据中的抽样点遵循最接近的观测结果模式的情况下工作,该模式位于距离最近的邻国的某一领域以外的某一路径上。此外,这些方法将数以百计的基础 kNN 学习者与许多他们可能有很高的分类错误从而导致不良的组合。为了克服这些问题,建议了以千步方式确定邻居的最佳扩展邻里规则。从第一个最接近的抽样点开始,到看不见的观察。第二个最近的数据点与先前选定的数据点最接近。这个程序一直持续到获得所需的 k 观测次数为止。 组合中的每个基准模型都建在一个靴杆样品上,并配有随机的特征组合。为了克服这些问题,建议采用以千步形组合为基础的最佳邻区规则,在以千尺模型的基础上建立最接近的模型。