Using information-theoretic principles, we consider the generalization error (gen-error) of iterative semi-supervised learning (SSL) algorithms that iteratively generate pseudo-labels for a large amount of unlabelled data to progressively refine the model parameters. In contrast to most previous works that {\em bound} the gen-error, we provide an {\em exact} expression for the gen-error and particularize it to the binary Gaussian mixture model. Our theoretical results suggest that when the class conditional variances are not too large, the gen-error decreases with the number of iterations, but quickly saturates. On the flip side, if the class conditional variances (and so amount of overlap between the classes) are large, the gen-error increases with the number of iterations. To mitigate this undesirable effect, we show that regularization can reduce the gen-error. The theoretical results are corroborated by extensive experiments on the MNIST and CIFAR datasets in which we notice that for easy-to-distinguish classes, the gen-error improves after several pseudo-labelling iterations, but saturates afterwards, and for more difficult-to-distinguish classes, regularization improves the generalization performance.
翻译:使用信息理论原则, 我们考虑迭代半监督学习(SSL)算法的概括错误( gen- error), 迭代为大量无标签数据生成假标签, 以逐步完善模型参数。 与大多数先前的工程相比, 基因- error, 我们为gen- error 提供了一种 exact} 表达方式, 并将其具体化为二进制 Gaussian 混合模型。 我们的理论结果表明, 当等级条件差异不太大时, 迭代数的基因- eror 下降, 但快速饱和度。 在翻转的一面, 如果等级条件差异( 以及各类别之间的重叠程度) 较大, 基因- 增加与迭代数。 为了减轻这种不可取的影响, 我们显示, 规范化可以减少基因- 。 我们的理论结果通过对 MNIST 和 CIFAR 数据集的广泛实验得到证实, 其中我们注意到, 在容易产生分解的等级之后, 变常态性变变的等级, 改进 。