In this paper we propose a novel data-level algorithm for handling data imbalance in the classification task, Synthetic Majority Undersampling Technique (SMUTE). SMUTE leverages the concept of interpolation of nearby instances, previously introduced in the oversampling setting in SMOTE. Furthermore, we combine both in the Combined Synthetic Oversampling and Undersampling Technique (CSMOUTE), which integrates SMOTE oversampling with SMUTE undersampling. The results of the conducted experimental study demonstrate the usefulness of both the SMUTE and the CSMOUTE algorithms, especially when combined with more complex classifiers, namely MLP and SVM, and when applied on datasets consisting of a large number of outliers. This leads us to a conclusion that the proposed approach shows promise for further extensions accommodating local data characteristics, a direction discussed in more detail in the paper.
翻译:在本文中,我们提出了处理分类任务中数据不平衡的新的数据级算法,即合成多数抽样技术(SMUTE)。SMUTE利用了先前在SMOTE过度抽样环境中引入的附近情况的内插概念。此外,我们在综合合成过度抽样和低取样技术(CSMOUTE)中结合了SMOTE与SMUTE抽样相结合。进行实验研究的结果表明SMUTE和CSMOUTE算法的有用性,特别是当与更复杂的分类器(即MLP和SVM)结合时,当应用在由大量外部单位组成的数据集时。这使我们得出结论,拟议的方法有望进一步扩展,以适应当地数据特征,这是文件中更详细讨论的方向。