Deep learning models memorize training data, which hurts their ability to generalize to under-represented classes. We empirically study a convolutional neural network's internal representation of imbalanced image data and measure the generalization gap between a model's feature embeddings in the training and test sets, showing that the gap is wider for minority classes. This insight enables us to design an efficient three-phase CNN training framework for imbalanced data. The framework involves training the network end-to-end on imbalanced data to learn accurate feature embeddings, performing data augmentation in the learned embedded space to balance the train distribution, and fine-tuning the classifier head on the embedded balanced training data. We propose Expansive Over-Sampling (EOS) as a data augmentation technique to utilize in the training framework. EOS forms synthetic training instances as convex combinations between the minority class samples and their nearest enemies in the embedded space to reduce the generalization gap. The proposed framework improves the accuracy over leading cost-sensitive and resampling methods commonly used in imbalanced learning. Moreover, it is more computationally efficient than standard data pre-processing methods, such as SMOTE and GAN-based oversampling, as it requires fewer parameters and less training time.
翻译:深度学习模型对培训数据进行记忆化的深层次培训数据,这会损害他们向代表性不足的班级推广培训的能力。我们通过实验研究进化神经网络内部的不平衡图像数据内部代表,并测量模型特征嵌入培训和测试组之间的一般差距,表明少数群体班的差距更大。这种洞察力使我们能够设计一个高效的三阶段CNN数据不平衡数据培训框架。框架包括就不平衡数据进行网络端对端培训,以学习准确的特征嵌入,在学习的嵌入空间进行数据扩增,以平衡火车分布,在嵌入的平衡培训数据上对分类器头进行微调。我们提议扩大过度抽样(EOS)作为培训框架使用的一种数据增强技术。EOS将合成培训作为少数群体班样本与嵌入空间中最接近的敌人之间的组合,以减少总体差距。拟议框架提高了在不平衡学习中通常使用的主要成本敏感和再抽样方法的准确性。此外,它比标准的数据预处理方法(SMO)要求的频率要小一些,例如STE,例如SMO,要求以较少的时间处理方法。