Pseudo-labels are confident predictions made on unlabeled target data by a classifier trained on labeled source data. They are widely used for adapting a model to unlabeled data, e.g., in a semi-supervised learning setting. Our key insight is that pseudo-labels are naturally imbalanced due to intrinsic data similarity, even when a model is trained on balanced source data and evaluated on balanced target data. If we address this previously unknown imbalanced classification problem arising from pseudo-labels instead of ground-truth training labels, we could remove model biases towards false majorities created by pseudo-labels. We propose a novel and effective debiased learning method with pseudo-labels, based on counterfactual reasoning and adaptive margins: The former removes the classifier response bias, whereas the latter adjusts the margin of each class according to the imbalance of pseudo-labels. Validated by extensive experimentation, our simple debiased learning delivers significant accuracy gains over the state-of-the-art on ImageNet-1K: 26% for semi-supervised learning with 0.2% annotations and 9% for zero-shot learning. Our code is available at: https://github.com/frank-xwang/debiased-pseudo-labeling.
翻译:在标签源数据方面受过训练的分类人员对未贴标签目标数据所作的预测是用标签源数据培训的分类人员对未贴标签的目标数据所作的自信预测。这些预测被广泛用于将模型与未贴标签数据相适应,例如半监督的学习环境。我们的关键洞察力是,伪标签由于内在数据相似性而自然地不平衡,即使一个模型是用平衡源数据培训的,并且根据平衡目标数据进行评估。如果我们解决以前未知的假标签而非地面真相培训标签产生的不平衡分类问题,我们就可以消除对假标签产生的虚假多数的模型偏见。我们基于反事实推理和适应性边际,提出一种创新和有效的不偏向学习方法。前者消除了分类者反应偏差,而后者根据假标签的不平衡性调整了每个班级的差。经过广泛的实验验证,我们简单的不偏差学习在图像Net-1K的状态上带来显著的准确性收益:以0.2%的描述值和9%的零光学准则用于半监督的半监督学习。