Classification data sets with skewed class proportions are called imbalanced. Class imbalance is a problem since most machine learning classification algorithms are built with an assumption of equal representation of all classes in the training dataset. Therefore to counter the class imbalance problem, many algorithm-level and data-level approaches have been developed. These mainly include ensemble learning and data augmentation techniques. This paper shows a new way to counter the class imbalance problem through a new data-splitting strategy called balanced split. Data splitting can play an important role in correctly classifying imbalanced datasets. We show that the commonly used data-splitting strategies have some disadvantages, and our proposed balanced split has solved those problems.
翻译:类别不平衡是一个问题,因为大多数机器学习分类算法的构建假设所有类别在培训数据集中都有平等的代表性。因此,为了解决阶级不平衡问题,已经制定了许多算法层次和数据层次的方法,主要包括混合学习和数据增强技术。本文通过称为平衡分割的新的数据分割战略,展示了解决阶级不平衡问题的新方法。数据分割在正确分类不平衡的数据集方面可以发挥重要作用。我们表明,常用的数据分割战略有一些缺点,我们提议的平衡分割已经解决这些问题了。