We present a general methodology that learns to classify images without labels by leveraging pretrained feature extractors. Our approach involves self-distillation training of clustering heads, based on the fact that nearest neighbors in the pretrained feature space are likely to share the same label. We propose a novel objective to learn associations between images by introducing a variant of pointwise mutual information together with instance weighting. We demonstrate that the proposed objective is able to attenuate the effect of false positive pairs while efficiently exploiting the structure in the pretrained feature space. As a result, we improve the clustering accuracy over $k$-means on $17$ different pretrained models by $6.1$\% and $12.2$\% on ImageNet and CIFAR100, respectively. Finally, using self-supervised pretrained vision transformers we push the clustering accuracy on ImageNet to $61.6$\%. The code will be open-sourced.
翻译:我们提出了一种通用方法,通过利用预训练特征提取器学习分类没有标签的图像。我们的方法涉及到聚类头的自我蒸馏训练,基于预训练特征空间中的最近邻可能具有相同的标签的事实。我们提出了一种新的目标函数来通过引入点互信息的变体及实例加权来学习图像之间的关联。我们证明了所提出的目标函数能够减弱假正例对聚类准确度的影响,同时有效利用预训练特征空间中的结构。因此,我们在17个不同的预训练模型上使聚类精度比k-means提高了分别为6.1%和12.2%的ImageNet和CIFAR100。最后,使用自监督预训练的视觉Transformer,我们将ImageNet上的聚类精度推向了61.6%。代码将开源。