Recently contrastive learning has shown significant progress in learning visual representations from unlabeled data. The core idea is training the backbone to be invariant to different augmentations of an instance. While most methods only maximize the feature similarity between two augmented data, we further generate more challenging training samples and force the model to keep predicting discriminative representation on these hard samples. In this paper, we propose MixSiam, a mixture-based approach upon the traditional siamese network. On the one hand, we input two augmented images of an instance to the backbone and obtain the discriminative representation by performing an element-wise maximum of two features. On the other hand, we take the mixture of these augmented images as input, and expect the model prediction to be close to the discriminative representation. In this way, the model could access more variant data samples of an instance and keep predicting invariant discriminative representations for them. Thus the learned model is more robust compared to previous contrastive learning methods. Extensive experiments on large-scale datasets show that MixSiam steadily improves the baseline and achieves competitive results with state-of-the-art methods. Our code will be released soon.
翻译:最近对比式的学习显示,在从未贴标签的数据中学习视觉表现方面取得了显著进展。核心理念是训练骨干,使其不易产生不同增强的特征。虽然大多数方法只是尽量扩大两种增强的数据之间的相似性,但我们进一步生成了更具挑战性的培训样本,并迫使模型不断预测这些硬样本的歧视性代表性。在本文中,我们提议采用混合方法MixSiam,一种基于混合物的方法,针对传统的Siamese网络。一方面,我们向骨干输入两个放大的示例图像,通过执行一个元素角度上的最大两个特征来获得歧视性的表述。另一方面,我们将这些放大的图像混合起来作为输入,并期望模型预测接近歧视性表述。这样,模型可以获取一个实例的更多变量样本,并不断预测这些样本的差别性表述。因此,与以往的对比性学习方法相比,所学的模型更为牢固。大规模数据集的广泛实验显示,MixSiaim将稳步改进基线,并实现与州级方法的竞争结果。我们的代码将很快发布。