The recent breakthrough achieved by contrastive learning accelerates the pace for deploying unsupervised training on real-world data applications. However, unlabeled data in reality is commonly imbalanced and shows a long-tail distribution, and it is unclear how robustly the latest contrastive learning methods could perform in the practical scenario. This paper proposes to explicitly tackle this challenge, via a principled framework called Self-Damaging Contrastive Learning (SDCLR), to automatically balance the representation learning without knowing the classes. Our main inspiration is drawn from the recent finding that deep models have difficult-to-memorize samples, and those may be exposed through network pruning. It is further natural to hypothesize that long-tail samples are also tougher for the model to learn well due to insufficient examples. Hence, the key innovation in SDCLR is to create a dynamic self-competitor model to contrast with the target model, which is a pruned version of the latter. During training, contrasting the two models will lead to adaptive online mining of the most easily forgotten samples for the current target model, and implicitly emphasize them more in the contrastive loss. Extensive experiments across multiple datasets and imbalance settings show that SDCLR significantly improves not only overall accuracies but also balancedness, in terms of linear evaluation on the full-shot and few-shot settings. Our code is available at: https://github.com/VITA-Group/SDCLR.
翻译:最近通过对比式学习取得的突破加快了在现实世界数据应用方面部署不受监督的培训的速度。然而,在现实中,没有标记的数据通常不平衡,而且显示的是长尾分布,不清楚最新对比式学习方法在实际情景中能够发挥多大的力度。本文件提议通过一个称为自我开发反竞争学习(SDCLR)的原则性框架,明确应对这一挑战,自动平衡代表性学习,而不了解各个班级。我们的主要灵感来自最近的调查结果,即深层模型难以模拟样本,这些样本可能通过网络运行而暴露出来。更自然的是,假设尺寸,长尾样本对于模型来说,由于实例不足,学习效果也更加困难。因此,SDCLR的关键创新是创建一种动态的自我兼容模型,与目标模型形成对比,而目标模型是原始版本。在培训期间,对比两种模型将导致对当前目标模型最容易被遗忘的样本进行适应性在线挖掘,这些样本可能通过网络运行运行,并且可能通过网络运行模式暴露出来。长尾的样本也更自然地假设,长尾样样本对于模型来说,对于模型来说,由于实例不够充分的例子,对于模型,对于模型的学习来说也更加困难。SDC的全面实验显示多重的模型的模型的全局。S-剖析中的数据定义是全面的系统。S-范围的模型的全的模型的精确的实验,因此显示,整个的模型的模型是整个的模型的全局。