This paper proposes a simple self-supervised approach for learning a representation for visual correspondence from raw video. We cast correspondence as prediction of links in a space-time graph constructed from video. In this graph, the nodes are patches sampled from each frame, and nodes adjacent in time can share a directed edge. We learn a representation in which pairwise similarity defines transition probability of a random walk, so that long-range correspondence is computed as a walk along the graph. We optimize the representation to place high probability along paths of similarity. Targets for learning are formed without supervision, by cycle-consistency: the objective is to maximize the likelihood of returning to the initial node when walking along a graph constructed from a palindrome of frames. Thus, a single path-level constraint implicitly supervises chains of intermediate comparisons. When used as a similarity metric without adaptation, the learned representation outperforms the self-supervised state-of-the-art on label propagation tasks involving objects, semantic parts, and pose. Moreover, we demonstrate that a technique we call edge dropout, as well as self-supervised adaptation at test-time, further improve transfer for object-centric correspondence.
翻译:本文建议了一种简单的自我监督方法, 学习原始视频中的视觉通信代表方式。 我们投递函文作为视频构建的空间时段图中链接的预测。 在这张图中, 节点是每个框架样本的补丁, 时间相邻的节点可以共享定向边缘 。 我们学习了一种代表方式, 即双向相似性可以定义随机行走的过渡概率, 这样长距离通信可以以图上行走的方式计算出来 。 我们优化了代表方式, 以便沿着相似的路径放置高的概率 。 学习目标是通过循环一致性在没有监督的情况下形成 : 目标是在沿着从框架的顶层构造绘制的图上行走时, 最大限度地将返回初始节点的可能性最大化 。 因此, 一个单一路径级的制约可以隐含着监督中间比较链 。 当作为类似性指标使用时, 学习的代号超越了在标签传播任务上自我监督的状态, 包括物体、 语义部分 和 姿势上。 此外, 我们证明一种技术叫做边缘丢物, 以及测试时自我监控的对等通信的转换 。