Binary classification with an imbalanced dataset is challenging. Models tend to consider all samples as belonging to the majority class. Although existing solutions such as sampling methods, cost-sensitive methods, and ensemble learning methods improve the poor accuracy of the minority class, these methods are limited by overfitting problems or cost parameters that are difficult to decide. We propose HADR, a hybrid approach with dimension reduction that consists of data block construction, dimentionality reduction, and ensemble learning with deep neural network classifiers. We evaluate the performance on eight imbalanced public datasets in terms of recall, G-mean, and AUC. The results show that our model outperforms state-of-the-art methods.
翻译:具有不平衡数据集的二进制分类具有挑战性。 模型倾向于将所有样本都视为属于多数类。 尽管现有的解决方案,如抽样方法、成本敏感方法和混合学习方法,提高了少数类的准确性,但这些方法却因过于适应问题或难以决定的成本参数而受到限制。 我们建议采用一个包含数据块构造、分解减少和与深层神经网络分类者共同学习的减少维度的混合方法 。 我们评估了八个不平衡的公共数据集在召回、G值和AUC方面的性能。 结果表明,我们的模型优于最先进的方法。