Self-training algorithms, which train a model to fit pseudolabels predicted by another previously-learned model, have been very successful for learning with unlabeled data using neural networks. However, the current theoretical understanding of self-training only applies to linear models. This work provides a unified theoretical analysis of self-training with deep networks for semi-supervised learning, unsupervised domain adaptation, and unsupervised learning. At the core of our analysis is a simple but realistic ``expansion'' assumption, which states that a low-probability subset of the data must expand to a neighborhood with large probability relative to the subset. We also assume that neighborhoods of examples in different classes have minimal overlap. We prove that under these assumptions, the minimizers of population objectives based on self-training and input-consistency regularization will achieve high accuracy with respect to ground-truth labels. By using off-the-shelf generalization bounds, we immediately convert this result to sample complexity guarantees for neural nets that are polynomial in the margin and Lipschitzness. Our results help explain the empirical successes of recently proposed self-training algorithms which use input consistency regularization.
翻译:自我培训算法将模型训练成适合另一个先前获得的模型预测的假标签,在使用神经网络使用无标签数据学习方面非常成功。 但是,目前对自我培训的理论理解只适用于线性模型。 这项工作提供了对自我培训的统一理论分析,并建立了半监督学习、不受监督的域适应和不受监督学习的深网络。 在我们的分析核心中,这是一个简单而现实的“ 扩展” 假设,其中指出低概率数据分组必须扩大到一个与子集相比概率大的社区。 我们还假设,不同类别中的一些实例有极少的重叠。 我们证明,根据这些假设,基于自我培训和投入一致性的规范化的人口目标最小化者在地面图例标签方面将达到很高的精确度。 我们通过使用现成的常规化界限,立即将这一结果转换为神经网的样本复杂性保证,这种保证在边缘和利普施茨茨茨(lipschitzness)。 我们的结果有助于解释最近提议采用输入一致性的正规化的自我培训成功经验。