Unlike the case when using a balanced training dataset, the per-class recall (i.e., accuracy) of neural networks trained with an imbalanced dataset are known to vary a lot from category to category. The convention in long-tailed recognition is to manually split all categories into three subsets and report the average accuracy within each subset. We argue that under such an evaluation setting, some categories are inevitably sacrificed. On one hand, focusing on the average accuracy on a balanced test set incurs little penalty even if some worst performing categories have zero accuracy. On the other hand, classes in the "Few" subset do not necessarily perform worse than those in the "Many" or "Medium" subsets. We therefore advocate to focus more on improving the lowest recall among all categories and the harmonic mean of all recall values. Specifically, we propose a simple plug-in method that is applicable to a wide range of methods. By simply re-training the classifier of an existing pre-trained model with our proposed loss function and using an optional ensemble trick that combines the predictions of the two classifiers, we achieve a more uniform distribution of recall values across categories, which leads to a higher harmonic mean accuracy while the (arithmetic) average accuracy is still high. The effectiveness of our method is justified on widely used benchmark datasets.
翻译:与使用平衡训练数据集的情况不同,使用不平衡数据集训练的神经网络的每个类别的召回率(即准确率)已知会因类别而异。在长尾识别中,惯例是将所有类别手动分为三个子集,并报告每个子集中的平均准确率。我们认为,在这种评估设置下,某些类别是不可避免地被牺牲的。一方面,在平衡测试集上关注平均准确率即使一些最劣执行的类别具有零准确率也会损失很小。另一方面,“Few”子集中的类别并不一定比“Many”或“Medium”子集中的类别表现更差。因此,我们主张更加注重提高所有类别中最低的召回率和所有召回率值的调和平均值。具体而言,我们提出了一种简单的插件方法,适用于各种方法。通过在现有预训练模型的分类器上使用我们提出的损失函数重新训练分类器,并使用可选的集成技巧来合并两个分类器的预测,我们实现了召回值在类别之间更加均匀分布,从而在算术平均准确率仍然很高的情况下实现更高的调和平均准确率。我们的方法的有效性已在广泛使用的基准数据集上得到验证。