In this work, we present a new computer vision task named video object of interest segmentation (VOIS). Given a video and a target image of interest, our objective is to simultaneously segment and track all objects in the video that are relevant to the target image. This problem combines the traditional video object segmentation task with an additional image indicating the content that users are concerned with. Since no existing dataset is perfectly suitable for this new task, we specifically construct a large-scale dataset called LiveVideos, which contains 2418 pairs of target images and live videos with instance-level annotations. In addition, we propose a transformer-based method for this task. We revisit Swin Transformer and design a dual-path structure to fuse video and image features. Then, a transformer decoder is employed to generate object proposals for segmentation and tracking from the fused features. Extensive experiments on LiveVideos dataset show the superiority of our proposed method.
翻译:在这项工作中,我们展示了一个新的计算机视觉任务,名为“兴趣分割的视频对象 ” ( VOIS ) 。 根据视频和感兴趣的目标图像,我们的目标是同时分割和跟踪视频中与目标图像相关的所有对象。 这个问题将传统视频对象分割任务与显示用户所关注内容的额外图像结合起来。 由于没有任何现有数据集完全适合这一新任务, 我们专门构建了一个大型数据集, 名为“ 实况视频”, 包含2418对目标图像和带实例级说明的现场视频。 此外, 我们提出了基于变压器的这一任务方法。 我们重新审视了 Swin 变压器, 并设计了连接视频和图像特性的双路径结构。 然后, 使用了变压器解码器来生成用于分割和跟踪链接特性的物体建议。 有关LiveVideos 数据集的大规模实验显示了我们拟议方法的优越性。