Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have an often-overlooked confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resource-constrained settings, such as healthcare. This work advocates for a data-driven approach to prioritising samples for re-annotation - which we term "active label cleaning". We propose to rank instances according to estimated label correctness and labelling difficulty of each sample, and introduce a simulation framework to evaluate relabelling efficacy. Our experiments on natural images and on a new medical imaging benchmark show that cleaning noisy labels mitigates their negative impact on model training, evaluation, and selection. Crucially, the proposed active label cleaning enables correcting labels up to 4 times more effectively than typical random selection in realistic conditions, making better use of experts' valuable time for improving dataset quality.
翻译:称为标签噪声的数据注释中的缺陷不利于对机器学习模型的培训,而且对模型性能的评估具有经常被人们忽视的混乱影响。然而,在资源紧张的环境中,如医疗保健等,聘请专家通过充分重新点注大型数据集来清除标签噪音是行不通的。这项工作提倡以数据驱动的方式优先点评重新点的样本,我们称之为“活跃标签清洁”。我们建议根据估计的标签正确性和标签难度来排列实例,并采用模拟框架来评估重新标注效果。我们在自然图像和新的医学成像基准方面的实验表明,清洁噪音标签会减轻其对模型培训、评估和选择的负面影响。 很显然,拟议的积极标签清理使得在现实条件下纠正标签的效果比典型随机选择要高4倍,从而更好地利用专家的宝贵时间来改进数据集的质量。