Establishing visual correspondence across images is a challenging and essential task. Recently, an influx of self-supervised methods have been proposed to better learn representations for visual correspondence. However, we find that these methods often fail to leverage semantic information and over-rely on the matching of low-level features. In contrast, human vision is capable of distinguishing between distinct objects as a pretext to tracking. Inspired by this paradigm, we propose to learn semantic-aware fine-grained correspondence. Firstly, we demonstrate that semantic correspondence is implicitly available through a rich set of image-level self-supervised methods. We further design a pixel-level self-supervised learning objective which specifically targets fine-grained correspondence. For downstream tasks, we fuse these two kinds of complementary correspondence representations together, demonstrating that they boost performance synergistically. Our method surpasses previous state-of-the-art self-supervised methods using convolutional networks on a variety of visual correspondence tasks, including video object segmentation, human pose tracking, and human part tracking.
翻译:建立图像之间的视觉通信是一项艰巨而重要的任务。 最近,提出了大量自我监督的方法来更好地学习视觉通信的演示。 然而,我们发现,这些方法往往无法利用语义信息,在匹配低级别特征时往往过于依赖。 相反,人类的视觉能够区分不同对象,作为跟踪的借口。受这一范例的启发,我们建议学习语义认知精细的精细拼写通信。首先,我们证明,通过一套丰富的图像层面自我监督的方法,可以隐含地提供语义通信。我们进一步设计一个像素级的自我监督学习目标,具体针对精细的通信。对于下游任务,我们将这两种互补的通信演示结合起来,表明它们能协同地促进性能。我们的方法超过了以往在各种视觉通信任务,包括视频对象分割、人形跟踪和人形部分跟踪上,使用同革命网络的自我监管方法。