关于地物选择和数据抽样对不平衡分类的共同影响的实证研究 (An Empirical Study on the Joint Impact of Feature Selection and Data Resampling on Imbalance Classification)

Real-world datasets often present different degrees of imbalanced (i.e., long-tailed or skewed) distributions. While the majority (a.k.a., head or frequent) classes have sufficient samples, the minority (a.k.a., tail or rare) classes can be under-represented by a rather limited number of samples. On one hand, data resampling is a common approach to tackling class imbalance. On the other hand, dimension reduction, which reduces the feature space, is a conventional machine learning technique for building stronger classification models on a dataset. However, the possible synergy between feature selection and data resampling for high-performance imbalance classification has rarely been investigated before. To address this issue, this paper carries out a comprehensive empirical study on the joint influence of feature selection and resampling on two-class imbalance classification. Specifically, we study the performance of two opposite pipelines for imbalance classification, i.e., applying feature selection before or after data resampling. We conduct a large amount of experiments (a total of 9225 experiments) on 52 publicly available datasets, using 9 feature selection methods, 6 resampling approaches for class imbalance learning, and 3 well-known classification algorithms. Experimental results show that there is no constant winner between the two pipelines, thus both of them should be considered to derive the best performing model for imbalance classification. We also find that the performance of an imbalance classification model depends on the classifier adopted, the ratio between the number of majority and minority samples (IR), as well as on the ratio between the number of samples and features (SFR). Overall, this study should provide new reference value for researchers and practitioners in imbalance learning.

翻译：真实世界数据集往往呈现出不同程度的不平衡分布(即,长尾比或偏斜率)分布。虽然多数(a.k.a.a.、头部或经常)类的样本数量充足,但少数(a.k.a.a.a.、尾部或稀有)类的样本数量不足。一方面,数据再抽样是解决阶级不平衡的一种常见方法。另一方面,减少尺寸,缩小特征空间,是一种常规机器学习技术,用于在数据集上建立更强大的分类模型。然而,以前很少调查过大部分(a.k.a.a.、头部或经常)类(a.k.a.a.a.a.a.、尾部或稀有)类的样本样本数量充足。一方面,数据再抽样减少是一个常规的机器学习技巧,另一方面,我们在52个公开的参考数据集上进行大量测试(共进行9225次实验),使用9个特征选择的样本样本样本不平衡分类方法,本文对特征选择和两类失衡分类进行全面经验研究。具体地说,我们应该从两个模型中进行分析。

相关内容

特征选择

关注 5931

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

【干货书】机器人元素Elements of Robotics ，311页pdf

专知会员服务

38+阅读 · 2021年4月16日

因果图，Causal Graphs，52页ppt

专知会员服务

250+阅读 · 2020年4月19日

【干货书】真实机器学习，264页pdf，Real-World Machine Learning

专知会员服务

115+阅读 · 2020年4月5日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日