Learning from class imbalanced datasets poses challenges for many machine learning algorithms. Many real-world domains are, by definition, class imbalanced by virtue of having a majority class that naturally has many more instances than its minority class (e.g. genuine bank transactions occur much more often than fraudulent ones). Many methods have been proposed to solve the class imbalance problem, among the most popular being oversampling techniques (such as SMOTE). These methods generate synthetic instances in the minority class, to balance the dataset, performing data augmentations that improve the performance of predictive machine learning (ML) models. In this paper we advance a novel data augmentation method (adapted from eXplainable AI), that generates synthetic, counterfactual instances in the minority class. Unlike other oversampling techniques, this method adaptively combines exist-ing instances from the dataset, using actual feature-values rather than interpolating values between instances. Several experiments using four different classifiers and 25 datasets are reported, which show that this Counterfactual Augmentation method (CFA) generates useful synthetic data points in the minority class. The experiments also show that CFA is competitive with many other oversampling methods many of which are variants of SMOTE. The basis for CFAs performance is discussed, along with the conditions under which it is likely to perform better or worse in future tests.
翻译:从阶级不平衡的数据集中学习给许多机器学习算法带来了挑战。根据定义,许多真实世界域由于拥有一个自然比其少数类多得多的多数阶层而出现了等级不平衡,而多数阶层自然比其少数阶层多得多的情况(例如真正的银行交易比欺诈性交易要多得多 ) 。 已经提出了许多方法来解决阶级不平衡问题,其中最受欢迎的是过度抽样技术(如SMOTE ) 。这些方法在少数阶层产生合成案例,以平衡数据集,进行数据扩充,从而改进预测性机器学习模型的性能。在本文中,我们推行一种新的数据扩充方法(从易燃的AI改编),在少数阶层中产生合成的反事实事件。与其他过度抽样技术不同,这种方法适应性地结合了从数据集中产生的例子,使用实际的特征价值,而不是在两种情况下的相互调和值。据报告,使用四种不同的分类和25个数据集进行的一些实验,表明这种反事实推断方法在少数阶层中产生有用的合成数据点(从易燃的AI),在SAFAS测试中进行许多次的变式的测试,在SAFATO的测试中进行。