This paper presents Dense Siamese Network (DenseSiam), a simple unsupervised learning framework for dense prediction tasks. It learns visual representations by maximizing the similarity between two views of one image with two types of consistency, i.e., pixel consistency and region consistency. Concretely, DenseSiam first maximizes the pixel level spatial consistency according to the exact location correspondence in the overlapped area. It also extracts a batch of region embeddings that correspond to some sub-regions in the overlapped area to be contrasted for region consistency. In contrast to previous methods that require negative pixel pairs, momentum encoders or heuristic masks, DenseSiam benefits from the simple Siamese network and optimizes the consistency of different granularities. It also proves that the simple location correspondence and interacted region embeddings are effective enough to learn the similarity. We apply DenseSiam on ImageNet and obtain competitive improvements on various downstream tasks. We also show that only with some extra task-specific losses, the simple framework can directly conduct dense prediction tasks. On an existing unsupervised semantic segmentation benchmark, it surpasses state-of-the-art segmentation methods by 2.1 mIoU with 28% training costs. Code and models are released at https://github.com/ZwwWayne/DenseSiam.
翻译:本文展示了Dense Siamese 网络(DenseSiaim),这是一个简单且不受监督的密集的预测任务学习框架。它通过最大限度地扩大两种观点之间对一种图像的相似性来学习视觉表现,两种图像具有两种一致性,即像素一致性和地区一致性。具体地说,DenseSiaim首先根据重叠区域的确切位置通信,最大限度地实现像素水平的空间一致性。它还提取了一组与重叠区域某些分区相对应的区域嵌入器,以对区域一致性进行比较。与以往需要负像素配对、动电解码或脂质遮罩的方法相比,它从简单的Siamuse网络获得好处,并优化不同颗粒特性的一致性。它证明简单的定位通信和互动区域嵌入功能对于了解相似性非常有效。我们在图像网上应用DencySiam,并在各种下游任务上获得竞争性改进。我们还表明,只有某些特殊任务损失,简单的框架才能直接进行密度预测任务组合任务、动力摄像仪或超额部分。关于现有的S2.1路路路路段的模型是现有的非超级成本。