Semi-supervised algorithms aim to learn prediction functions from a small set of labeled observations and a large set of unlabeled observations. Because this framework is relevant in many applications, they have received a lot of interest in both academia and industry. Among the existing techniques, self-training methods have undoubtedly attracted greater attention in recent years. These models are designed to find the decision boundary on low density regions without making additional assumptions about the data distribution, and use the unsigned output score of a learned classifier, or its margin, as an indicator of confidence. The working principle of self-training algorithms is to learn a classifier iteratively by assigning pseudo-labels to the set of unlabeled training samples with a margin greater than a certain threshold. The pseudo-labeled examples are then used to enrich the labeled training data and to train a new classifier in conjunction with the labeled training set. In this paper, we present self-training methods for binary and multi-class classification; as well as their variants and two related approaches, namely consistency-based approaches and transductive learning. We examine the impact of significant self-training features on various methods, using different general and image classification benchmarks, and we discuss our ideas for future research in self-training. To the best of our knowledge, this is the first thorough and complete survey on this subject.
翻译:半监督的算法旨在从一小套贴标签的观察和一大批未贴标签的观察中学习预测功能。由于这一框架在许多应用中都具有相关性,因此在学术界和工业界都对这个框架非常感兴趣。在现有的技术中,自培训方法近年来无疑引起了更多的注意。这些模型的设计是为了在低密度区域找到决定界限,而不必对数据分布作出额外的假设,并且使用一个有学识的分类师的未签名产出分数或其差值作为信任指标。自我培训算法的工作原则是通过给一套未贴标签的培训样本配上假标签来迭接地学习一个分类员,其差值大于某一阈值。假标签的例子后来被用来丰富已贴标签的培训数据,并结合标签的培训成套培训材料来培训新的分类员。在本文中,我们介绍了二进制和多级分类的自我培训方法;以及它们的变式和两种相关方法,即基于一致性的方法和转换式的学习方法。我们研究了一些重要的自培训特点对各种方法的影响,我们使用不同的普通的分类和全面的研究,我们首先讨论了这一课题的自我研究。