A key challenge in Machine Learning is class imbalance, where the sample size of some classes (majority classes) are much higher than that of the other classes (minority classes). If we were to train a classifier directly on imbalanced data, it is more likely for the classifier to predict a new sample as one of the majority classes. In the extreme case, the classifier could completely ignore the minority classes. This could have serious sociological implications in healthcare, as the minority classes are usually the disease classes (e.g., death or positive clinical test result). In this paper, we introduce a software that uses Generative Adversarial Networks to oversample the minority classes so as to improve downstream classification. To the best of our knowledge, this is the first tool that allows multi-class classification (where the target can have an arbitrary number of classes). The code of the tool is publicly available in our github repository (https://github.com/yuxiaohuang/research/tree/master/gwu/working/cigan/code).
翻译:机器学习中的一个关键挑战是阶级不平衡,因为某些班级(多数班级)的抽样规模比其他班级(少数班级)要高得多。如果我们直接根据不平衡的数据对分类员进行培训,那么分类员更有可能预测一个新的抽样作为多数班级之一。在极端的情况下,分类员可能完全忽视少数班级。这可能会对卫生保健产生严重的社会影响,因为少数民族班通常是疾病班(例如死亡或积极的临床试验结果)。在本文中,我们引入了一个软件,利用基因性对流网络过度模拟少数民族班级,以改进下游分类。根据我们的知识,这是允许多级分类的第一个工具(目标可以任意增加班级数量)。工具的代码在我们的Github 库(https://github.com/yuxiaaugh/research/tree/master/gwu/working/ciggan/code)中公开提供。