通过开放世界抽样改进对不平衡种子数据的矛盾学习 (Improving Contrastive Learning on Imbalanced Seed Data via Open-World Sampling)

Contrastive learning approaches have achieved great success in learning visual representations with few labels of the target classes. That implies a tantalizing possibility of scaling them up beyond a curated "seed" benchmark, to incorporating more unlabeled images from the internet-scale external sources to enhance its performance. However, in practice, larger amount of unlabeled data will require more computing resources due to the bigger model size and longer training needed. Moreover, open-world unlabeled data usually follows an implicit long-tail class or attribute distribution, many of which also do not belong to the target classes. Blindly leveraging all unlabeled data hence can lead to the data imbalance as well as distraction issues. This motivates us to seek a principled approach to strategically select unlabeled data from an external source, in order to learn generalizable, balanced and diverse representations for relevant classes. In this work, we present an open-world unlabeled data sampling framework called Model-Aware K-center (MAK), which follows three simple principles: (1) tailness, which encourages sampling of examples from tail classes, by sorting the empirical contrastive loss expectation (ECLE) of samples over random data augmentations; (2) proximity, which rejects the out-of-distribution outliers that may distract training; and (3) diversity, which ensures diversity in the set of sampled examples. Empirically, using ImageNet-100-LT (without labels) as the seed dataset and two "noisy" external data sources, we demonstrate that MAK can consistently improve both the overall representation quality and the class balancedness of the learned features, as evaluated via linear classifier evaluation on full-shot and few-shot settings. The code is available at: \url{https://github.com/VITA-Group/MAK

翻译：对比式学习方法在学习视觉表现方面取得了巨大成功,目标类的标签很少。这意味着可以将视觉表现方法推广到超越标定的“种子”基准之外,将互联网规模外部来源的更多未贴标签图像纳入互联网规模的外部来源,以提高其性能。然而,在实践中,由于模型规模较大,培训需要时间更长,更多的未贴标签数据将需要更多的计算资源。此外,开放世界无标签数据通常遵循隐含的长尾类或属性分布,其中许多也不属于目标类。盲目地利用所有未贴标签的数据,从而导致数据失衡以及持续分散问题。这促使我们寻求一种原则性的方法,从外部来源战略性地选择未贴标签的数据,以便学习通用性、平衡和多样化的表述。在这项工作中,我们提出了一个开放世界无标签的数据取样框架,称为Model-Awary K-center(MAK),这遵循三个简单的原则:(1) 质量,它鼓励从尾类中提取实例,通过对实验性对比性图像损失进行分类(ECLE) 以及从整体上采集数据样本的离差性数据。