Imbalanced Data (ID) is a problem that deters Machine Learning (ML) models for achieving satisfactory results. ID is the occurrence of a situation where the quantity of the samples belonging to one class outnumbers that of the other by a wide margin, making such models learning process biased towards the majority class. In recent years, to address this issue, several solutions have been put forward, which opt for either synthetically generating new data for the minority class or reducing the number of majority classes for balancing the data. Hence, in this paper, we investigate the effectiveness of methods based on Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs), mixed with a variety of well-known imbalanced data solutions meaning oversampling and undersampling. To evaluate our methods, we have used KEEL, breast cancer, and Z-Alizadeh Sani datasets. In order to achieve reliable results, we conducted our experiments 100 times with randomly shuffled data distributions. The classification results demonstrate that the mixed Synthetic Minority Oversampling Technique (SMOTE)-Normalization-CNN outperforms different methodologies achieving 99.08% accuracy on the 24 imbalanced datasets. Therefore, the proposed mixed model can be applied to imbalanced binary classification problems on other real datasets.
翻译:失平衡数据(ID)是一个阻碍机器学习模式实现满意结果的问题。 ID是一个问题, 这个问题使机器学习模式无法达到令人满意的结果。 ID是一个存在的情况, 属于一个阶级的样本数量远远超过另一个阶级的数量,使这种模型学习过程偏向于多数阶级。 近年来,为了解决这个问题,我们提出了若干解决办法,选择为少数阶级合成生成新数据,或减少多数阶级数量以平衡数据。 因此, 我们在本文件中调查基于深神经网络和进化神经网络的方法的有效性, 与各种众所周知的不平衡数据解决方案混在一起, 意味着过度抽样和低抽样。 为了评估我们的方法, 我们使用了KEEL、乳腺癌和Z- Alizadeh Sani数据集。 为了取得可靠的结果, 我们进行了100次实验, 随机调整了数据分布。 分类结果显示, 混合合成少数群体过大技术(SMOTE)和演化神经网络(CNNN)的方法, 混合了各种已知的不平衡的数据解决方案, 意味着过度抽样和低底抽样。 将其他数据分类方法应用到已经形成。