Learning a good representation for space-time correspondence is the key for various computer vision tasks, including tracking object bounding boxes and performing video object pixel segmentation. To learn generalizable representation for correspondence in large-scale, a variety of self-supervised pretext tasks are proposed to explicitly perform object-level or patch-level similarity learning. Instead of following the previous literature, we propose to learn correspondence using Video Frame-level Similarity (VFS) learning, i.e, simply learning from comparing video frames. Our work is inspired by the recent success in image-level contrastive learning and similarity learning for visual recognition. Our hypothesis is that if the representation is good for recognition, it requires the convolutional features to find correspondence between similar objects or parts. Our experiments show surprising results that VFS surpasses state-of-the-art self-supervised approaches for both OTB visual object tracking and DAVIS video object segmentation. We perform detailed analysis on what matters in VFS and reveals new properties on image and frame level similarity learning. Project page with code is available at https://jerryxu.net/VFS
翻译:学习空间时间通信的良好代表性是各种计算机视觉任务的关键,包括跟踪对象捆绑框和进行视频对象像素分割。为了学习大规模通信的一般代表性,建议了各种自我监督的托辞任务,以明确进行目标级或近距离相似的学习。我们不遵循先前的文献,而是建议使用视频框架级相似性(VFS)学习来学习通信,即仅仅从比较视频框架中学习。我们的工作受到图像级对比学习和类似学习的最新成功启发,以便视觉识别。我们的假设是,如果这种代表性有利于识别,则需要进化特征来查找类似对象或部分之间的通信。我们的实验结果令人惊讶地显示,VFS超过最先进的自我监督方法,用于OTB视觉对象跟踪和DAVIS视频对象分割。我们对VFS中的事项进行详细分析,并在图像和框架级相似性学习中显示新的特性。有代码的项目网页见https://jerryxu.net/VFS。有代码的项目网页可在https://jerfs/VFS。