High-quality data is necessary for modern machine learning. However, the acquisition of such data is difficult due to noisy and ambiguous annotations of humans. The aggregation of such annotations to determine the label of an image leads to a lower data quality. We propose a data-centric image classification benchmark with ten real-world datasets and multiple annotations per image to allow researchers to investigate and quantify the impact of such data quality issues. With the benchmark we can study the impact of annotation costs and (semi-)supervised methods on the data quality for image classification by applying a novel methodology to a range of different algorithms and diverse datasets. Our benchmark uses a two-phase approach via a data label improvement method in the first phase and a fixed evaluation model in the second phase. Thereby, we give a measure for the relation between the input labeling effort and the performance of (semi-)supervised algorithms to enable a deeper insight into how labels should be created for effective model training. Across thousands of experiments, we show that one annotation is not enough and that the inclusion of multiple annotations allows for a better approximation of the real underlying class distribution. We identify that hard labels can not capture the ambiguity of the data and this might lead to the common issue of overconfident models. Based on the presented datasets, benchmarked methods, and analysis, we create multiple research opportunities for the future directed at the improvement of label noise estimation approaches, data annotation schemes, realistic (semi-)supervised learning, or more reliable image collection.
翻译:高品质数据是现代机器学习所必需的。然而,由于人类的杂音和模糊不清的说明,这些数据的获取十分困难。这种说明的汇总以决定图像标签的标签导致数据质量下降。我们提出一个以数据为中心的图像分类基准,包括10个真实世界数据集和每个图像的多个说明,以便研究人员能够调查和量化这类数据质量问题的影响。有了这个基准,我们可以研究批注成本和(半)监督的图像分类数据质量方法的影响,办法是对一系列不同的算法和不同的数据集采用新的可靠估算方法。我们的基准在第一阶段采用数据标签改进方法和第二阶段采用固定评价模型,采用两阶段方法。我们据此提出一个衡量投入标签工作与(半)监督的算法的绩效之间关系的尺度,以便更深入地了解如何为有效的模型培训创建标签。在数千个实验中,我们显示一个说明不够充分,而且列入多个说明可以更接近基础类分类改进方法,在第二阶段采用一个固定评价模型的模型。我们确定一个硬性标签,用来衡量这一基础数据分配方法。