Class imbalance, which is also called long-tailed distribution, is a common problem in classification tasks based on machine learning. If it happens, the minority data will be overwhelmed by the majority, which presents quite a challenge for data science. To address the class imbalance problem, researchers have proposed lots of methods: some people make the data set balanced (SMOTE), some others refine the loss function (Focal Loss), and even someone has noticed the value of labels influences class-imbalanced learning (Yang and Xu. Rethinking the value of labels for improving class-imbalanced learning. In NeurIPS 2020), but no one changes the way to encode the labels of data yet. Nowadays, the most prevailing technique to encode labels is the one-hot encoding due to its nice performance in the general situation. However, it is not a good choice for imbalanced data, because the classifier will treat majority and minority samples equally. In this paper, we innovatively propose the enhancement encoding technique, which is specially designed for the imbalanced classification. The enhancement encoding combines re-weighting and cost-sensitiveness, which can reflect the difference between hard and easy (or minority and majority) classes. In order to reduce the number of validation samples and the computation cost, we also replace the confusion matrix with the novel soft-confusion matrix which works better with a small validation set. In the experiments, we evaluate the enhancement encoding with three different types of loss. And the results show that enhancement encoding is very effective to improve the performance of the network trained with imbalanced data. Particularly, the performance on minority classes is much better.
翻译:类分布不均, 也被称为长期分类分布, 是基于机器学习的分类任务中常见的问题。 如果发生这种情况, 少数数据会被多数数据压垮, 这给数据科学带来相当大的挑战。 为解决阶级不平衡问题, 研究人员提出了许多方法: 一些人使数据集平衡( SMOTE ), 另一些人改进了损失功能( Focal Loss ), 甚至有人注意到标签的价值会影响类平衡学习( Yang 和 Xu ) 。 重新思考标签的不平衡值, 以改善类平衡学习。 在 NeurIPS 2020 中, 少数数据将被多数数据标记编码的方式改变。 如今, 最常用的编码标签技术是因其总体表现优异而导致的一热编码( SMOTE ) 。 然而, 这对失衡数据功能( Focalles) 来说并不是一个好的选择, 因为分类将同等对待多数和少数样本。 我们创新地提出了强化编码技术, 用于不平衡的分类。 在 NeurIP 类中, 将重新加权和成本升级的结果结合 。