Real-world datasets often present different degrees of imbalanced (i.e., long-tailed or skewed) distributions. While the majority (a.k.a., head or frequent) classes have sufficient samples, the minority (a.k.a., tail or rare) classes can be under-represented by a rather limited number of samples. On one hand, data resampling is a common approach to tackling class imbalance. On the other hand, dimension reduction, which reduces the feature space, is a conventional machine learning technique for building stronger classification models on a dataset. However, the possible synergy between feature selection and data resampling for high-performance imbalance classification has rarely been investigated before. To address this issue, this paper carries out a comprehensive empirical study on the joint influence of feature selection and resampling on two-class imbalance classification. Specifically, we study the performance of two opposite pipelines for imbalance classification, i.e., applying feature selection before or after data resampling. We conduct a large amount of experiments (a total of 9225 experiments) on 52 publicly available datasets, using 9 feature selection methods, 6 resampling approaches for class imbalance learning, and 3 well-known classification algorithms. Experimental results show that there is no constant winner between the two pipelines, thus both of them should be considered to derive the best performing model for imbalance classification. We also find that the performance of an imbalance classification model depends on the classifier adopted, the ratio between the number of majority and minority samples (IR), as well as on the ratio between the number of samples and features (SFR). Overall, this study should provide new reference value for researchers and practitioners in imbalance learning.
翻译:真实世界数据集往往呈现出不同程度的不平衡分布(即,长尾比或偏斜率)分布。虽然多数(a.k.a.a.、头部或经常)类的样本数量充足,但少数(a.k.a.a.a.、尾部或稀有)类的样本数量不足。一方面,数据再抽样是解决阶级不平衡的一种常见方法。另一方面,减少尺寸,缩小特征空间,是一种常规机器学习技术,用于在数据集上建立更强大的分类模型。然而,以前很少调查过大部分(a.k.a.a.、头部或经常)类(a.k.a.a.a.a.a.、尾部或稀有)类的样本样本数量充足。一方面,数据再抽样减少是一个常规的机器学习技巧,另一方面,我们在52个公开的参考数据集上进行大量测试(共进行9225次实验),使用9个特征选择的样本样本样本不平衡分类方法,本文对特征选择和两类失衡分类进行全面经验研究。 具体地说,我们应该从两个模型中进行分析。