Learning a good representation for space-time correspondence is the key for various computer vision tasks, including tracking object bounding boxes and performing video object pixel segmentation. To learn generalizable representation for correspondence in large-scale, a variety of self-supervised pretext tasks are proposed to explicitly perform object-level or patch-level similarity learning. Instead of following the previous literature, we propose to learn correspondence using Video Frame-level Similarity (VFS) learning, i.e, simply learning from comparing video frames. Our work is inspired by the recent success in image-level contrastive learning and similarity learning for visual recognition. Our hypothesis is that if the representation is good for recognition, it requires the convolutional features to find correspondence between similar objects or parts. Our experiments show surprising results that VFS surpasses state-of-the-art self-supervised approaches for both OTB visual object tracking and DAVIS video object segmentation. We perform detailed analysis on what matters in VFS and reveals new properties on image and frame level similarity learning. Project page is available at https://jerryxu.net/VFS.
翻译:学习空间时间通信的良好表现是各种计算机视觉任务的关键,包括跟踪对象捆绑框和进行视频对象像素分割。为了学习大规模通信的一般代表性,建议了各种自我监督的借口任务,以明确进行目标级或近距离相似的学习。我们不遵循以前的文献,而是建议使用视频框架级相似性学习来学习通信,即仅仅从比较视频框架中学习。我们的工作受到图像级对比学习的成功和类似学习的启发,以便视觉识别。我们的假设是,如果这种表现有利于识别,就需要进化特征来寻找类似对象或部分之间的通信。我们的实验结果令人惊讶地显示,VFS超越了OTB视觉对象跟踪和DAVIS视频对象最先进的自我监督方法。我们对VFS中的事项进行了详细分析,并揭示了图像和框架级相似性学习的新特性。项目网页见https://jerryxu.net/VFS。