A long-standing issue with deep learning is the need for large and consistently labeled datasets. Although the current research in semi-supervised learning can decrease the required amount of annotated data by a factor of 10 or even more, this line of research still uses distinct classes like cats and dogs. However, in the real-world we often encounter problems where different experts have different opinions, thus producing fuzzy labels. We propose a novel framework for handling semi-supervised classifications of such fuzzy labels. Our framework is based on the idea of overclustering to detect substructures in these fuzzy labels. We propose a novel loss to improve the overclustering capability of our framework and show on the common image classification dataset STL-10 that it is faster and has better overclustering performance than previous work. On a real-world plankton dataset, we illustrate the benefit of overclustering for fuzzy labels and show that we beat previous state-of-the-art semisupervised methods. Moreover, we acquire 5 to 10% more consistent predictions of substructures.
翻译:深层学习的长期问题是需要大量和一致的标签数据集。 虽然目前半监督学习的研究可以将所需附加说明的数据数量减少10倍甚至10倍以上,但这一研究仍使用猫和狗等不同类别。然而,在现实世界中,我们经常遇到不同专家意见不同从而产生模糊标签的问题。我们提出了处理这种模糊标签的半监督分类的新框架。我们的框架基于过度分组以探测这些模糊标签中的子结构的想法。我们提议进行新的损失,以提高我们框架的过度分组能力,并在通用图像分类数据集STL-10上显示,这种能力比以往更快,而且比以往的工作要好得多。在现实世界的浮游生物数据集中,我们展示了对模糊标签过度分组的好处,并表明我们击败了以前最先进的半监督方法。此外,我们获得了5-10%以上对子结构的一致预测。