In this paper, we propose an ensemble learning algorithm called \textit{under-bagging $k$-nearest neighbors} (\textit{under-bagging $k$-NN}) for imbalanced classification problems. On the theoretical side, by developing a new learning theory analysis, we show that with properly chosen parameters, i.e., the number of nearest neighbors $k$, the expected sub-sample size $s$, and the bagging rounds $B$, optimal convergence rates for under-bagging $k$-NN can be achieved under mild assumptions w.r.t.~the arithmetic mean (AM) of recalls. Moreover, we show that with a relatively small $B$, the expected sub-sample size $s$ can be much smaller than the number of training data $n$ at each bagging round, and the number of nearest neighbors $k$ can be reduced simultaneously, especially when the data are highly imbalanced, which leads to substantially lower time complexity and roughly the same space complexity. On the practical side, we conduct numerical experiments to verify the theoretical results on the benefits of the under-bagging technique by the promising AM performance and efficiency of our proposed algorithm.
翻译:在本文中,我们建议对不平衡的分类问题采用混合学习算法,称为\ textit{ under- bushing $k$-nN} (\ textit{ under- bucking $k$-NN}) 。 在理论方面,我们通过开发新的学习理论分析,表明根据适当选择的参数,即最近的邻居人数(k美元)、预期的次级抽样规模(美元)和袋状回合(B$),在轻度假设(r.t.~回顾的算术平均值(AM)下,可以实现低价美元-NNN美元的最佳趋同率。此外,我们表明,如果使用相对小的B美元,预期的子抽样规模($)可能大大小于每轮包装中的培训数据数量($),而最近的邻居人数(美元)可以同时减少,特别是当数据高度失衡,导致时间复杂性大大降低,而且大致是相同的空间复杂性。在实际方面,我们进行数字实验,以通过高估的算法来验证我们所提出的效率的理论结果。