A recurring focus of the deep learning community is towards reducing the labeling effort. Data gathering and annotation using a search engine is a simple alternative to generating a fully human-annotated and human-gathered dataset. Although web crawling is very time efficient, some of the retrieved images are unavoidably noisy, i.e. incorrectly labeled. Designing robust algorithms for training on noisy data gathered from the web is an important research perspective that would render the building of datasets easier. In this paper we conduct a study to understand the type of label noise to expect when building a dataset using a search engine. We review the current limitations of state-of-the-art methods for dealing with noisy labels for image classification tasks in the case of web noise distribution. We propose a simple solution to bridge the gap with a fully clean dataset using Dynamic Softening of Out-of-distribution Samples (DSOS), which we design on corrupted versions of the CIFAR-100 dataset, and compare against state-of-the-art algorithms on the web noise perturbated MiniImageNet and Stanford datasets and on real label noise datasets: WebVision 1.0 and Clothing1M. Our work is fully reproducible https://git.io/JKGcj
翻译:深层学习界的反复关注焦点是减少标签工作。 使用搜索引擎收集数据和批注是生成完全人文化的附加说明和人造数据集的简单替代方法。 虽然网络爬行非常及时,但有些检索到的图像却不可避免地吵起来,也就是说标签不正确。 设计对从网络收集的噪音数据进行培训的强大算法是一个重要的研究角度,将使建立数据集的工作更容易。 本文我们进行研究,以了解使用搜索引擎建立数据集时预期的标签噪音类型。 我们审查了在网络噪音传播情况下处理图像分类任务的噪音标签方面目前最先进的方法的局限性。 我们提出了一个简单的解决方案,用完全清洁的数据集来弥补差距,我们用动态软化的流出样品设计了腐败版本的CFAR-100数据集,并与网络噪音/移动的MinimageNet和斯坦福数据库数据库数据库的状态算法进行了比较。 我们的网络噪音定义是:1. 1.0GMI 和MestrogI 的实时标签和图像数据库。