Recent state-of-the-art methods in semi-supervised learning (SSL) combine consistency regularization with confidence-based pseudo-labeling. To obtain high-quality pseudo-labels, a high confidence threshold is typically adopted. However, it has been shown that softmax-based confidence scores in deep networks can be arbitrarily high for samples far from the training data, and thus, the pseudo-labels for even high-confidence unlabeled samples may still be unreliable. In this work, we present a new perspective of pseudo-labeling: instead of relying on model confidence, we instead measure whether an unlabeled sample is likely to be "in-distribution"; i.e., close to the current training data. To classify whether an unlabeled sample is "in-distribution" or "out-of-distribution", we adopt the energy score from out-of-distribution detection literature. As training progresses and more unlabeled samples become in-distribution and contribute to training, the combined labeled and pseudo-labeled data can better approximate the true distribution to improve the model. Experiments demonstrate that our energy-based pseudo-labeling method, albeit conceptually simple, significantly outperforms confidence-based methods on imbalanced SSL benchmarks, and achieves competitive performance on class-balanced data. For example, it produces a 4-6% absolute accuracy improvement on CIFAR10-LT when the imbalance ratio is higher than 50. When combined with state-of-the-art long-tailed SSL methods, further improvements are attained.
翻译:近年在半监督的学习中采用的最新最先进的方法(SSL)将一致性规范化与基于信任的假标签相结合。 为了获得高质量的伪标签,通常会采用高信任门槛。 但是,已经表明,对于远离培训数据的样本来说,深网络中基于软最大基础的信任分数可能任意高,因此,即使是高信任无标签样本的假标签可能仍然不可靠。 在这项工作中,我们提出了假标签的新观点:我们不是依赖模型信心,而是衡量一个未标签的样本是否有可能“在分配中”;即接近目前的培训数据。为了对非标签抽样的“在分配中”还是“在分配中”的“不在分配中”的样本进行分类,我们采用分配外检测文献中的能源分数。由于培训进展和更多未贴标签的样本在分配中变得不可靠,因此,加标签和假标签的合并数据可以更好地接近模型的真实分布。 实验表明,我们基于能源的近10 %的联合标签的改进方法是否“在分配中”;尽管概念上,S-6级(SLF)的准确性标准在高标准上实现了一种基于模型的稳性,但简单的SLFLF的指数,在高标准上实现了一个基于能源的模型的模型的模型的稳性基准,在显著的模型上实现了一个基于的模型的模型的模型的精确性。