While self-supervised learning has enabled effective representation learning in the absence of labels, for vision, video remains a relatively untapped source of supervision. To address this, we propose Pixel-level Correspondence (PiCo), a method for dense contrastive learning from video. By tracking points with optical flow, we obtain a correspondence map which can be used to match local features at different points in time. We validate PiCo on standard benchmarks, outperforming self-supervised baselines on multiple dense prediction tasks, without compromising performance on image classification.
翻译:虽然自我监督的学习使得在没有标签的情况下能够有效地进行代表性学习,但对于视觉而言,视频仍然是一个相对未开发的监督来源。为了解决这个问题,我们提议采用像素级通信(PiCo),这是从视频中进行密集对比式学习的一种方法。通过光学流动的跟踪点,我们获得了一份通信地图,可以在不同的时间点用于匹配当地特征。我们在标准基准上验证了PiCo,在多个密集的预测任务上业绩优于自我监督的基线,同时不影响图像分类的性能。