Semi-supervised learning is a powerful technique for leveraging unlabeled data to improve machine learning models, but it can be affected by the presence of ``informative'' labels, which occur when some classes are more likely to be labeled than others. In the missing data literature, such labels are called missing not at random. In this paper, we propose a novel approach to address this issue by estimating the missing-data mechanism and using inverse propensity weighting to debias any SSL algorithm, including those using data augmentation. We also propose a likelihood ratio test to assess whether or not labels are indeed informative. Finally, we demonstrate the performance of the proposed methods on different datasets, in particular on two medical datasets for which we design pseudo-realistic missing data scenarios.
翻译:半监督的学习是利用未贴标签的数据改进机器学习模式的有力技术,但它可能受到“信息规范”标签的存在的影响,这些标签出现在某些类比其他类更有可能贴上标签的时候。在缺失的数据文献中,这类标签被称为不随机缺失。在本文中,我们提出一种新的方法来解决这一问题,方法是估计缺失的数据机制,并使用反向偏重权重来贬低任何 SSL 算法,包括使用数据增强的算法。我们还提议了一种可能性比率测试,以评估标签是否确实具有信息性。最后,我们展示了不同数据集的拟议方法的性能,特别是两个医疗数据集的性能,我们为此设计了假现实的缺失数据情景。