ImageNet has been arguably the most popular image classification benchmark, but it is also the one with a significant level of label noise. Recent studies have shown that many samples contain multiple classes, despite being assumed to be a single-label benchmark. They have thus proposed to turn ImageNet evaluation into a multi-label task, with exhaustive multi-label annotations per image. However, they have not fixed the training set, presumably because of a formidable annotation cost. We argue that the mismatch between single-label annotations and effectively multi-label images is equally, if not more, problematic in the training setup, where random crops are applied. With the single-label annotations, a random crop of an image may contain an entirely different object from the ground truth, introducing noisy or even incorrect supervision during training. We thus re-label the ImageNet training set with multi-labels. We address the annotation cost barrier by letting a strong image classifier, trained on an extra source of data, generate the multi-labels. We utilize the pixel-wise multi-label predictions before the final pooling layer, in order to exploit the additional location-specific supervision signals. Training on the re-labeled samples results in improved model performances across the board. ResNet-50 attains the top-1 classification accuracy of 78.9% on ImageNet with our localized multi-labels, which can be further boosted to 80.2% with the CutMix regularization. We show that the models trained with localized multi-labels also outperforms the baselines on transfer learning to object detection and instance segmentation tasks, and various robustness benchmarks. The re-labeled ImageNet training set, pre-trained weights, and the source code are available at {https://github.com/naver-ai/relabel_imagenet}.
翻译:图像网络可以说是最受欢迎的图像分类基准 {, 但是它也是一个非常受欢迎的图像分类基准 。 最近的研究显示, 许多样本包含多个类, 尽管假设是一个单一标签基准 。 因此, 他们提议将图像网络评价转换成一个多标签任务, 每个图像都有详尽的多标签说明 。 但是, 他们还没有固定培训数据集, 可能是因为一个可怕的注释成本。 我们争辩说, 单标签说明和有效多标签图像之间的不匹配, 如果不是更多的话, 在使用随机作物的训练设置中同样( 如果不是更多的话) 存在问题 。 有了单标签说明, 一个图像的随机特性可能包含一个与地面真相完全不同的对象 。 在培训中引入噪音甚至不正确的监管 。 因此我们用多标签重新标签将图像网络培训设置成多标签 。 我们用一个强大的图像前分类器, 生成多标签。 我们使用像素的多标签预测, 在最后的集合层之前, 我们使用多标签的多标签基准, 以便利用更多具体位置的缩略图 。