Semi-supervised learning (SSL) leverages both labeled and unlabeled data for training models when the labeled data is limited and the unlabeled data is vast. Frequently, the unlabeled data is more widely available than the labeled data, hence this data is used to improve the level of generalization of a model when the labeled data is scarce. However, in real-world settings unlabeled data might depict a different distribution than the labeled dataset distribution. This is known as distribution mismatch. Such problem generally occurs when the source of unlabeled data is different from the labeled data. For instance, in the medical imaging domain, when training a COVID-19 detector using chest X-ray images, different unlabeled datasets sampled from different hospitals might be used. In this work, we propose an automatic thresholding method to filter out-of-distribution data in the unlabeled dataset. We use the Mahalanobis distance between the labeled and unlabeled datasets using the feature space built by a pre-trained Image-net Feature Extractor (FE) to score each unlabeled observation. We test two simple automatic thresholding methods in the context of training a COVID-19 detector using chest X-ray images. The tested methods provide an automatic manner to define what unlabeled data to preserve when training a semi-supervised deep learning architecture.
翻译:在标签数据有限且未标签数据非常庞大的情况下,半监督的学习(SSL)将标签和未标签的数据用于培训模式。通常,未标签数据比标签数据更容易获得,因此,当标签数据稀少时,这些数据被用来提高模型的通用程度。然而,在现实世界设置中,未标签数据可能显示与标签数据集分布不同的分布。这被称为分布错配。当未标签数据来源与标签数据不同时,通常会出现这样的问题。例如,在医疗成像领域,当使用胸前X射线图像培训COVID-19探测器时,可能会使用不同医院取样的不同未标签数据集。在这项工作中,我们提出一种自动阈值方法,用于在未标签数据集中筛选分配数据。我们使用预先培训过的图像网中深层提取器和未标签数据集之间的马哈拉诺比距离。在使用预先测试过的图像网深层图像摄取器(FESE)中,在对每个未标签的图像进行评分值时,可以使用两个简单的自动标定的标定的标定的标定标准。我们用两个测试的标定的标定的标前的标定的标定的标定的标定的标定的标定的标定的标定方法,用来在每次的标定的标定的标定的标定的标定的标定的标定的标定的标定的标定的底图像。