质疑培训数据的可靠性 (Ghost Loss to Question the Reliability of Training Data)

from arxiv, This is the authors' preprint version of a paper published in IEEE Access in 2020. Please cite it as follows: A. Deli\`ege, A. Cioppa and M. Van Droogenbroeck, "Ghost Loss to Question the Reliability of Training Data", in IEEE Access, vol. 8, pp. 44774-44782, 2020, doi: 10.1109/ACCESS.2020.2978283

Supervised image classification problems rely on training data assumed to have been correctly annotated; this assumption underpins most works in the field of deep learning. In consequence, during its training, a network is forced to match the label provided by the annotator and is not given the flexibility to choose an alternative to inconsistencies that it might be able to detect. Therefore, erroneously labeled training images may end up ``correctly'' classified in classes which they do not actually belong to. This may reduce the performances of the network and thus incite to build more complex networks without even checking the quality of the training data. In this work, we question the reliability of the annotated datasets. For that purpose, we introduce the notion of ghost loss, which can be seen as a regular loss that is zeroed out for some predicted values in a deterministic way and that allows the network to choose an alternative to the given label without being penalized. After a proof of concept experiment, we use the ghost loss principle to detect confusing images and erroneously labeled images in well-known training datasets (MNIST, Fashion-MNIST, SVHN, CIFAR10) and we provide a new tool, called sanity matrix, for summarizing these confusions.

翻译：受监督的图像分类问题依赖于假定正确附加说明的培训数据;这一假设是深层学习领域大多数工作的基础,因此,在培训期间,一个网络被迫与说明者提供的标签匹配,没有灵活性选择替代它可能检测到的不一致之处。因此,错误标签的培训图像最终可能“正确”被归类为他们实际上不属于的类别。这可能会降低网络的性能,从而煽动在甚至不检查培训数据质量的情况下建立更复杂的网络。在这项工作中,我们质疑附加说明数据集的可靠性。为此,我们引入了幽灵损失的概念,这可以被看作是一种定期损失,以决定性的方式为某些预测值零出,并允许网络在不受到处罚的情况下选择给定标签的替代品。在进行概念实验后,我们使用“幽灵损失”原则来检测混淆的图像和在众所周知的培训数据集中错误贴标签的图像(MNIST、Fashin-MINST、SVHN、CIFAR10),我们提供了一个新的数据库(MARIGMMMM),这些叫做SMAR10。