以西亚图像建模促进自我监督的视野代表制学习 (Siamese Image Modeling for Self-Supervised Vision Representation Learning)

Self-supervised learning (SSL) has delivered superior performance on a variety of downstream vision tasks. Two main-stream SSL frameworks have been proposed, i.e., Instance Discrimination (ID) and Masked Image Modeling (MIM). ID pulls together the representations of different views from the same image, while avoiding feature collapse. It does well on linear probing but is inferior in detection performance. On the other hand, MIM reconstructs the original content given a masked image. It excels at dense prediction but fails to perform well on linear probing. Their distinctions are caused by neglecting the representation requirements of either semantic alignment or spatial sensitivity. Specifically, we observe that (1) semantic alignment demands semantically similar views to be projected into nearby representation, which can be achieved by contrasting different views with strong augmentations; (2) spatial sensitivity requires to model the local structure within an image. Predicting dense representations with masked image is therefore beneficial because it models the conditional distribution of image content. Driven by these analysis, we propose Siamese Image Modeling (SIM), which predicts the dense representations of an augmented view, based on another masked view from the same image but with different augmentations. Our method uses a Siamese network with two branches. The online branch encodes the first view, and predicts the second view's representation according to the relative positions between these two views. The target branch produces the target by encoding the second view. In this way, we are able to achieve comparable linear probing and dense prediction performances with ID and MIM, respectively. We also demonstrate that decent linear probing result can be obtained without a global loss. Code shall be released.

翻译：自我监督的学习(SSL) 在一系列下游愿景任务中取得了优异的成绩。已经提出了两个主要流的 SSL 框架, 即“ 时间区分( ID) ” 和“ 掩码图像模型( MIM) ) 。 ID 将相同图像的不同观点的表达方式集中在一起, 避免特征崩溃。它在线性测试方面表现良好, 但在检测性表现方面表现较差。另一方面, MIM 重建了蒙面图像的原始内容。它在密集预测方面表现优于密集的预测, 但未能在线性测试中表现良好。它们的区别在于忽略了语义调整或空间敏感度的表达方式。具体地说, 我们观察到的是(1) 语义一致性调整要求将相似的观点投射到附近的代表方式, 可以通过强度放大不同观点, (2) 空间敏感性要求在图像中建模当地结构。因此, 以蒙面图像的密集表达方式是有利的, 因为它模拟了图像的有条件分布。我们建议Sia 建模 (SIM) 建模(Simma) 建模) 在两个直径的直径视图之间, 以不同的浏览中, 以另一种以不同的浏览以不同的浏览以不同的浏览方式, 以不同的浏览为基以不同的浏览以不同的浏览。