A deep clustering model conceptually consists of a feature extractor that maps data points to a latent space, and a clustering head that groups data points into clusters in the latent space. Although the two components used to be trained jointly in an end-to-end fashion, recent works have proved it beneficial to train them separately in two stages. In the first stage, the feature extractor is trained via self-supervised learning, which enables the preservation of the cluster structures among the data points. To preserve the cluster structures even better, we propose to replace the first stage with another model that is pretrained on a much larger dataset via self-supervised learning. The method is simple and might suffer from domain shift. Nonetheless, we have empirically shown that it can achieve superior clustering performance. When a vision transformer (ViT) architecture is used for feature extraction, our method has achieved clustering accuracy 94.0%, 55.6% and 97.9% on CIFAR-10, CIFAR-100 and STL-10 respectively. The corresponding previous state-of-the-art results are 84.3%, 47.7% and 80.8%. Our code will be available online with the publication of the paper.
翻译:深度集成模型概念上包含一个特征提取器,用来绘制数据指向潜藏空间,而一个组合头则将数据归为潜藏空间中的组群。虽然这两个组成部分过去曾以端到端方式联合培训,但最近的工作证明在两个阶段分别培训它们是有益的。在第一阶段,特征提取器是通过自我监督学习培训的,这样就可以在数据点中保存集群结构。为了更好地保护集群结构,我们提议用另一个模型取代第一阶段,该模型通过自我监督学习将数据组预先训练为大得多的数据集。该方法很简单,并可能因域变而受到影响。尽管如此,我们从经验上表明,它能够取得优异的组合性能。当一个视觉变异器(VIT)结构被用于特征提取时,我们的方法已经实现了精确度94.0%、55.6%和97.9%的组合在CIFAR-10、CIFAR-100和STL-10上。我们提出的前次最新结果是84.3%、47.7%和80.8%。我们的代码将在网上公布该文件。