Classification of imbalanced data is one of the common problems in the recent field of data mining. Imbalanced data substantially affects the performance of standard classification models. Data-level approaches mainly use the oversampling methods to solve the problem, such as synthetic minority oversampling Technique (SMOTE). However, since the methods such as SMOTE generate instances by linear interpolation, synthetic data space may look like a polygonal. Also, the oversampling methods generate outliers of the minority class. In this paper, we proposed Gaussian based minority oversampling technique (GMOTE) with a statistical perspective for imbalanced datasets. To avoid linear interpolation and to consider outliers, this proposed method generates instances by the Gaussian Mixture Model. Motivated by clustering-based multivariate Gaussian outlier score (CMGOS), we propose to adapt tail probability of instances through the Mahalanobis distance to consider local outliers. The experiment was carried out on a representative set of benchmark datasets. The performance of the GMOTE is compared with other methods such as SMOTE. When the GMOTE is combined with classification and regression tree (CART) or support vector machine (SVM), it shows better accuracy and F1-Score. Experimental results demonstrate the robust performance.
翻译:数据不平衡数据分类是最近数据挖掘领域常见的问题之一。数据平衡数据在很大程度上影响了标准分类模型的性能。数据级方法主要使用过度抽样方法解决问题,例如合成少数群体过量抽样技术(SMOTE),但是,由于SMOTE等方法产生线性内插,合成数据空间可能看起来像一个多边形。此外,过度抽样方法产生少数类的外源。在本文中,我们提议从统计角度对不平衡数据集进行基于少数类的过度抽样技术(GMOTE)的统计分析。为避免线性内插和考虑外源,拟议的方法产生了高斯混凝土模型的实例。由于基于集群的多变异性高斯离层评分(CMGGOS)的动力,我们提议通过马哈拉诺比距离来调整实例的尾端概率,以考虑本地外源。我们提议在一组有代表性的基准数据集上进行实验。GOTOSS的性能表现与STE1号机级分析结果相比更好。