Unsupervised disentangled representation learning is a long-standing problem in computer vision. This work proposes a novel framework for performing image clustering from deep embeddings by combining instance-level contrastive learning with a deep embedding based cluster center predictor. Our approach jointly learns representations and predicts cluster centers in an end-to-end manner. This is accomplished via a three-pronged approach that combines a clustering loss, an instance-wise contrastive loss, and an anchor loss. Our fundamental intuition is that using an ensemble loss that incorporates instance-level features and a clustering procedure focusing on semantic similarity reinforces learning better representations in the latent space. We observe that our method performs exceptionally well on popular vision datasets when evaluated using standard clustering metrics such as Normalized Mutual Information (NMI), in addition to producing geometrically well-separated cluster embeddings as defined by the Euclidean distance. Our framework performs on par with widely accepted clustering methods and outperforms the state-of-the-art contrastive learning method on the CIFAR-10 dataset with an NMI score of 0.772, a 7-8% improvement on the strong baseline.
翻译:不受监督的分解代表性学习是计算机视觉中长期存在的一个问题。 这项工作提出了一个新的框架,通过将试级对比性学习与深嵌入式集束中心预测器相结合,从深层嵌入中进行图像分组。 我们的方法是共同学习演示,并以端到端的方式预测集束中心。 这是通过三管齐下的方法实现的,该方法结合了集群损失、以实例为本的对比性损失和锚值损失。 我们的基本直觉是,使用包含实例级特征和以语义相似性为重点的群集程序的混合损失,加强了在潜空学习更好的代表性。 我们观察到,我们的方法在使用标准群集指标(如正常化互通信息(NMI))进行评估时,在流行的视觉数据集上表现得非常好,而采用标准的群集指标(如Uuclidean 距离) 定义的地理测量和分离的集群嵌入组合。 我们的框架与广泛接受的群集方法相匹配,并超越了CIFAR-10数据集成的状态对比性学习方法,比NMI的得分数为0.772,7-8%。