We currently do not have an understanding of semi-supervised learning (SSL) objectives such as pseudo-labelling and entropy minimization as log-likelihoods, which precludes the development of e.g. Bayesian SSL. Here, we note that benchmark image datasets such as CIFAR-10 are carefully curated, and we formulate SSL objectives as a log-likelihood in a generative model of data curation that was initially developed to explain the cold-posterior effect (Aitchison 2020). SSL methods, from entropy minimization and pseudo-labelling, to state-of-the-art techniques similar to FixMatch can be understood as lower-bounds on our principled log-likelihood. We are thus able to give a proof-of-principle for Bayesian SSL on toy data. Finally, our theory suggests that SSL is effective in part due to the statistical patterns induced by data curation. This provides an explanation of past results which show SSL performs better on clean datasets without any "out of distribution" examples. Confirming these results we find that SSL gave much larger performance improvements on curated than on uncurated data, using matched curated and uncurated datasets based on Galaxy Zoo 2.
翻译:我们目前对半监督的学习(SSL)目标没有理解,例如假标签和将最小化作为日志相似性,从而无法开发例如巴伊西亚SSL。在这里,我们注意到基准图像数据集,例如CIFAR-10等基准图像数据集是经过仔细整理的,我们把SSL目标作为数据整理的基因化模型中的一个日志模型,最初开发该模型是为了解释冷盘效应(Aitchison 2020年)。SSL方法,从加密最小化和伪标签,到与SixMatch类似的最新技术,可以被理解为我们有原则的日志类似性的低限。因此,我们可以为Bayesian SSL-10等基准图像数据集提供一个证据。最后,我们的理论表明,SLSL在部分程度上是有效的,这是由数据校正所引发的统计模式。这解释了过去的结果,它显示SSL在清洁数据集改进方面表现得更好,而没有“删除分发”实例。我们发现,在使用SLSLF在银河系统数据上比没有曲线上更精确的精确的数据。