Using search engines for web image retrieval is a tempting alternative to manual curation when creating an image dataset, but their main drawback remains the proportion of incorrect (noisy) samples retrieved. These noisy samples have been evidenced by previous works to be a mixture of in-distribution (ID) samples, assigned to the incorrect category but presenting similar visual semantics to other classes in the dataset, and out-of-distribution (OOD) images, which share no semantic correlation with any category from the dataset. The latter are, in practice, the dominant type of noisy images retrieved. To tackle this noise duality, we propose a two stage algorithm starting with a detection step where we use unsupervised contrastive feature learning to represent images in a feature space. We find that the alignment and uniformity principles of contrastive learning allow OOD samples to be linearly separated from ID samples on the unit hypersphere. We then spectrally embed the unsupervised representations using a fixed neighborhood size and apply an outlier sensitive clustering at the class level to detect the clean and OOD clusters as well as ID noisy outliers. We finally train a noise robust neural network that corrects ID noise to the correct category and utilizes OOD samples in a guided contrastive objective, clustering them to improve low-level features. Our algorithm improves the state-of-the-art results on synthetic noise image datasets as well as real-world web-crawled data. Our work is fully reproducible [github].
翻译:在创建图像数据集时,使用网络图像检索搜索引擎是人工缩略图的一种诱人选择,但其主要缺点仍然是不正确( noisy)样本的比例。这些噪音样本从以前的作品中得到证明,是分布(ID)样本的混合物,属于不正确的类别,但显示与数据集中其他类别相似的视觉语义学和分配(OOOD)图像,与数据集中的任何类别没有明显的语义关联。在实际中,后者是回收的噪音的主要类型。为了解决这种噪音的双重性,我们建议用一个探测步骤来开始两个阶段的算法,即我们使用非超导对比特征学习来在功能空间中代表图像。我们发现,对比学习的校准和统一原则允许将OOOD样本与单位超光谱层的其他类别的身份样本进行线性分离。我们随后用固定的邻里大小将不超超超超的表达式表达式嵌入,并在类中应用异常敏感组合来检测清洁和OOOOD数据组群群群,同时确定异常的外部结构。我们最后将一个稳健的对比性对比性对比性数据模型用来改进我们的网络,以纠正低层的图像。我们用稳健定的图像网络,改进了我们的图像。