This paper presents a simple yet effective approach to modeling space-time correspondences in the context of video object segmentation. Unlike most existing approaches, we establish correspondences directly between frames without re-encoding the mask features for every object, leading to a highly efficient and robust framework. With the correspondences, every node in the current query frame is inferred by aggregating features from the past in an associative fashion. We cast the aggregation process as a voting problem and find that the existing inner-product affinity leads to poor use of memory with a small (fixed) subset of memory nodes dominating the votes, regardless of the query. In light of this phenomenon, we propose using the negative squared Euclidean distance instead to compute the affinities. We validated that every memory node now has a chance to contribute, and experimentally showed that such diversified voting is beneficial to both memory efficiency and inference accuracy. The synergy of correspondence networks and diversified voting works exceedingly well, achieves new state-of-the-art results on both DAVIS and YouTubeVOS datasets while running significantly faster at 20+ FPS for multiple objects without bells and whistles.
翻译:本文展示了一种简单而有效的方法来模拟视频对象分割背景下的时空通信模式。 与大多数现有方法不同, 我们直接在框架之间建立对应, 而没有重新编码每个对象的掩码特征, 从而导致一个高效和健全的框架。 通过对应, 当前查询框架中的每一个节点都是通过以关联方式汇总过去特征来推断的。 我们把聚合过程作为一个投票问题, 发现现有的内产产品亲近性导致记忆使用不当, 与一个小型( 固定的) 内存节点小集管理选票, 不论查询结果如何。 鉴于这一现象, 我们提议使用负平方埃克里奥德距离, 而不是计算亲近性。 我们确认, 每一个记忆节点现在都有机会做出贡献, 并且实验性地表明, 这种多样化的投票既有利于记忆效率, 也有利于推断准确性。 通信网络和多样化的投票作品的协同作用非常好, 在DAVIS 和YouTubeVOS 数据设置上取得新的状态结果, 而在20+ FPS 哨站 运行得更快。