For the last two decades, oversampling has been employed to overcome the challenge of learning from imbalanced datasets. Many approaches to solving this challenge have been offered in the literature. Oversampling, on the other hand, is a concern. That is, models trained on fictitious data may fail spectacularly when put to real-world problems. The fundamental difficulty with oversampling approaches is that, given a real-life population, the synthesized samples may not truly belong to the minority class. As a result, training a classifier on these samples while pretending they represent minority may result in incorrect predictions when the model is used in the real world. We analyzed a large number of oversampling methods in this paper and devised a new oversampling evaluation system based on hiding a number of majority examples and comparing them to those generated by the oversampling process. Based on our evaluation system, we ranked all these methods based on their incorrectly generated examples for comparison. Our experiments using more than 70 oversampling methods and three imbalanced real-world datasets reveal that all oversampling methods studied generate minority samples that are most likely to be majority. Given data and methods in hand, we argue that oversampling in its current forms and methodologies is unreliable for learning from class imbalanced data and should be avoided in real-world applications.
翻译:在过去二十年中,为了克服从不平衡的数据集中学习的挑战,已经采用了过度抽样的方法。许多解决这一挑战的方法已经在文献中提出。另一方面,过度抽样是一个令人关切的问题。也就是说,在虚构数据方面受过训练的模型在面对现实世界的问题时可能会大失所望。过度抽样方法的根本困难在于,考虑到真实生命人口,综合样本可能并不真正属于少数类。因此,培训这些样本的分类员,同时假装它们代表少数类,可能会在实际世界中使用模型时导致不正确的预测。我们分析了本文中大量过度抽样的方法,并设计了一个新的基于隐藏大量多数例子的过度抽样评价系统,并将它们与过度抽样过程所产生的例子进行比较。根据我们的评估系统,我们根据这些不正确生成的示例排列了所有这些方法,以便进行比较。我们使用70多个过度抽样的方法和三个不平衡的现实世界数据集的实验结果可能显示,所研究的所有过度抽样方法都产生少数群体样本,而这些样本很可能是多数的,因此,我们从目前的分类中可以避免使用不可靠的方法。