We propose a novel framework for image clustering that incorporates joint representation learning and clustering. Our method consists of two heads that share the same backbone network - a "representation learning" head and a "clustering" head. The "representation learning" head captures fine-grained patterns of objects at the instance level which serve as clues for the "clustering" head to extract coarse-grain information that separates objects into clusters. The whole model is trained in an end-to-end manner by minimizing the weighted sum of two sample-oriented contrastive losses applied to the outputs of the two heads. To ensure that the contrastive loss corresponding to the "clustering" head is optimal, we introduce a novel critic function called "log-of-dot-product". Extensive experimental results demonstrate that our method significantly outperforms state-of-the-art single-stage clustering methods across a variety of image datasets, improving over the best baseline by about 5-7% in accuracy on CIFAR10/20, STL10, and ImageNet-Dogs. Further, the "two-stage" variant of our method also achieves better results than baselines on three challenging ImageNet subsets.
翻译:我们建议一个包含联合代表学习和分组的图像分组新框架。 我们的方法由两个共享主干网的首级“ 代表学习” 头和“ 分组”头的两头组成。 “ 代表学习”头在实例一级捕捉细微的物体类型, 用作“ 分组”头的线索, 以提取将对象分成组群的粗皮信息。 整个模型以端到端的方式接受培训, 最大限度地减少适用于两头输出的两个样本导向对比性损失的加权总和。 为了确保与“ 分组” 头相对应的对比性损失是最佳的, 我们引入了一个叫作“ 数字产品” 的新型评论功能。 广泛的实验结果显示, 我们的方法大大超越了将对象分成不同的图像数据集的单一阶段组合方法, 超过最佳基线, 将CIFAR10/20、 STL10 和图像网络- Dogs 的精确度降低5- 5- 7% 。 此外, “ 两阶段” 方法的变量也比三个具有挑战性的图像网络的基线取得更好的结果 。