Self-supervised learning (SSL) has delivered superior performance on a variety of downstream vision tasks. Two main-stream SSL frameworks have been proposed, i.e., Instance Discrimination (ID) and Masked Image Modeling (MIM). ID pulls together the representations of different views from the same image, while avoiding feature collapse. It does well on linear probing but is inferior in detection performance. On the other hand, MIM reconstructs the original content given a masked image. It excels at dense prediction but fails to perform well on linear probing. Their distinctions are caused by neglecting the representation requirements of either semantic alignment or spatial sensitivity. Specifically, we observe that (1) semantic alignment demands semantically similar views to be projected into nearby representation, which can be achieved by contrasting different views with strong augmentations; (2) spatial sensitivity requires to model the local structure within an image. Predicting dense representations with masked image is therefore beneficial because it models the conditional distribution of image content. Driven by these analysis, we propose Siamese Image Modeling (SIM), which predicts the dense representations of an augmented view, based on another masked view from the same image but with different augmentations. Our method uses a Siamese network with two branches. The online branch encodes the first view, and predicts the second view's representation according to the relative positions between these two views. The target branch produces the target by encoding the second view. In this way, we are able to achieve comparable linear probing and dense prediction performances with ID and MIM, respectively. We also demonstrate that decent linear probing result can be obtained without a global loss. Code shall be released at https://github.com/fundamentalvision/Siamese-Image-Modeling.
翻译:自我监督的学习(SSL)在一系列下游愿景任务中取得了优异的成绩。 已经提出了两个主要流的 SSL 框架, 即“ 测试( ID) ” 和“ 掩码图像模型( MIM) ) 。 ID 将相同图像的不同观点的表达方式集中在一起, 避免特征崩溃。 它在线性测试方面表现良好, 但在检测性业绩方面表现较差。 另一方面, MIM 重建了蒙面图像的原始内容。 它在密集预测中表现优于密集的预测, 但在线性测试中表现不佳。 它们的区别在于忽略了语义调整或空间敏感度的表达方式。 具体地说, 我们观察到的是, 语义调整的对立面要求从语义上相似的观点投射到附近的表达方式, 可以用强力放大的图像进行对比; (2) 空间敏感性要求在图像中建模。 因此, 以遮罩面图像的缩度表示是有益的, 因为它模拟了图像的有条件的分布。 在这种分析的驱动下, 我们建议Sia 图像建模( SIM ) 在两个直径浏览中, 以双直径的浏览中, 以不同的浏览 以不同的浏览 以不同的浏览 以另一个的图像为Oil 。