Semi-supervised learning approaches train on small sets of labeled data along with large sets of unlabeled data. Self-training is a semi-supervised teacher-student approach that often suffers from the problem of "confirmation bias" that occurs when the student model repeatedly overfits to incorrect pseudo-labels given by the teacher model for the unlabeled data. This bias impedes improvements in pseudo-label accuracy across self-training iterations, leading to unwanted saturation in model performance after just a few iterations. In this work, we describe multiple enhancements to improve the self-training pipeline to mitigate the effect of confirmation bias. We evaluate our enhancements over multiple datasets showing performance gains over existing self-training design choices. Finally, we also study the extendability of our enhanced approach to Open Set unlabeled data (containing classes not seen in labeled data).
翻译:自我培训是一种半监督的教师-学生方法,通常会遇到“校正偏差”问题。 学生模式反复过度使用教师模式为无标签数据提供的不正确的假标签时,就会出现这种“校正偏差”问题。 这种偏差妨碍了自培训迭代之间假标签准确性方面的改进,导致模型性能在几度迭代之后出现不必要的饱和。 在这项工作中,我们描述了改进自我培训管道以缓解确认偏差效应的多项改进措施。 我们评估了对多个数据集的改进措施,显示现有自我培训设计选择的性能收益。 最后,我们还研究了强化的 Open Set 无标签数据方法(包含标签数据中未见的分类 ) 的可扩展性。