Self-training (ST) is a simple and standard approach in semi-supervised learning that has been applied to many machine learning problems. Despite its widespread acceptance and practical effectiveness, it is still not well understood why and how ST improves performance by fitting the model to potentially erroneous pseudo-labels. To investigate the properties of ST, in this study, we derive and analyze a sharp characterization of the behavior of iterative ST when training a linear classifier by minimizing the ridge-regularized convex loss for binary Gaussian mixtures, in the asymptotic limit where input dimension and data size diverge proportionally. The derivation is based on the replica method of statistical mechanics. The result indicates that, when the total number of iterations is large, ST may find a classification plane with the optimal direction regardless of the label imbalance by accumulating small parameter updates over long iterations. It is argued that this is because the small update of ST can accumulate information of the data in an almost noiseless way. However, when a label imbalance is present in true labels, the performance of the ST is significantly lower than that of supervised learning with true labels, because the ratio between the norm of the weight and the magnitude of the bias can become significantly large. To overcome the problems in label imbalanced cases, several heuristics are introduced. By numerically analyzing the asymptotic formula, it is demonstrated that with the proposed heuristics, ST can find a classifier whose performance is nearly compatible with supervised learning using true labels even in the presence of significant label imbalance.
翻译:暂无翻译