We consider iterative semi-supervised learning (SSL) algorithms that iteratively generate pseudo-labels for a large amount unlabelled data to progressively refine the model parameters. In particular, we seek to understand the behaviour of the {\em generalization error} of iterative SSL algorithms using information-theoretic principles. To obtain bounds that are amenable to numerical evaluation, we first work with a simple model -- namely, the binary Gaussian mixture model. Our theoretical results suggest that when the class conditional variances are not too large, the upper bound on the generalization error decreases monotonically with the number of iterations, but quickly saturates. The theoretical results on the simple model are corroborated by extensive experiments on several benchmark datasets such as the MNIST and CIFAR datasets in which we notice that the generalization error improves after several pseudo-labelling iterations, but saturates afterwards.
翻译:我们考虑了迭代半监督的学习算法,这些算法为大量无标签数据迭代生成了假标签,以逐步完善模型参数。特别是,我们试图理解使用信息理论原理的迭代 SSL 算法的 {em 一般性错误} 的行为。为了获得可进行数字评估的界限,我们首先使用一个简单的模型 -- -- 即二进制高斯混合模型。我们的理论结果表明,当等级条件差异不太大时,一般化错误的上限会随着迭代数量而减少单调,但会迅速饱和。简单模型的理论结果得到若干基准数据集的广泛实验的证实,如MNIST和CIFAR数据集,我们发现,在几个假标签迭代后,一般化错误会改善,但随后会改善饱和度。