Pseudo-label-based semi-supervised learning (SSL) has achieved great success on raw data utilization. However, its training procedure suffers from confirmation bias due to the noise contained in self-generated artificial labels. Moreover, the model's judgment becomes noisier in real-world applications with extensive out-of-distribution data. To address this issue, we propose a general method named Class-aware Contrastive Semi-Supervised Learning (CCSSL), which is a drop-in helper to improve the pseudo-label quality and enhance the model's robustness in the real-world setting. Rather than treating real-world data as a union set, our method separately handles reliable in-distribution data with class-wise clustering for blending into downstream tasks and noisy out-of-distribution data with image-wise contrastive for better generalization. Furthermore, by applying target re-weighting, we successfully emphasize clean label learning and simultaneously reduce noisy label learning. Despite its simplicity, our proposed CCSSL has significant performance improvements over the state-of-the-art SSL methods on the standard datasets CIFAR100 and STL10. On the real-world dataset Semi-iNat 2021, we improve FixMatch by 9.80% and CoMatch by 3.18%. Code is available https://github.com/TencentYoutuResearch/Classification-SemiCLS.
翻译:在原始数据利用方面,基于Seudo标签的半监督性学习(SSL)取得了巨大的成功。然而,其培训程序由于自制人工标签中含有噪音,因而存在确认偏差。此外,模型的判断在现实应用中变得耳熟能详。为了解决这一问题,我们提出了一个名为“CSCSL(CSL)”的一般方法,该方法旨在改进假标签质量,加强模型在现实世界环境中的稳健性。我们的方法不是将真实世界数据作为联盟处理,而是单独处理可靠的分配数据,以类分组方式将数据与下游任务混合起来,并用图像角度对比性强的超分配数据进行杂音。此外,我们采用目标重标,成功地强调清洁标签学习,同时减少噪音标签学习。尽管它很简单,但我们提出的CSCCSLS在标准数据设置上采用最先进的SSLS方法,而不是将真实的SLS-18/Sremimical-RificalC100 和STLM10号,我们可以通过20-RFARC-S-S-RIMC) 和SICMC-C-C-C-SDRIC-SDMC-C-Reval2021和SDM-NLM10号改进了实际数据规则。