Existing state-of-the-art methods for Video Object Segmentation (VOS) learn low-level pixel-to-pixel correspondences between frames to propagate object masks across video. This requires a large amount of densely annotated video data, which is costly to annotate, and largely redundant since frames within a video are highly correlated. In light of this, we propose HODOR: a novel method that tackles VOS by effectively leveraging annotated static images for understanding object appearance and scene context. We encode object instances and scene information from an image frame into robust high-level descriptors which can then be used to re-segment those objects in different frames. As a result, HODOR achieves state-of-the-art performance on the DAVIS and YouTube-VOS benchmarks compared to existing methods trained without video annotations. Without any architectural modification, HODOR can also learn from video context around single annotated video frames by utilizing cyclic consistency, whereas other methods rely on dense, temporally consistent annotations.
翻译:因此,我们建议HODOR:一种新颖的方法,通过有效利用附加说明的静态图像了解物体外观和场景背景,解决VOS问题。我们从图像框中将物体实例和场景信息编码成强有力的高级别描述器,然后用于将这些物体分解到不同的框中。因此,HODOR实现了DAVIS和YouTube-VOS基准方面的最先进的性能,而与在没有视频说明的情况下培训的现有方法相比,HODOR实现了最先进的DAVIS和YouTube-VOS基准。在不作任何建筑修改的情况下,HODOR还可以利用循环一致性将单个附加说明的视频框架周围的视频背景从视频中学习,而其他方法则依赖于密集、时间一致的说明。