Previous works on video object segmentation (VOS) are trained on densely annotated videos. Nevertheless, acquiring annotations in pixel level is expensive and time-consuming. In this work, we demonstrate the feasibility of training a satisfactory VOS model on sparsely annotated videos-we merely require two labeled frames per training video while the performance is sustained. We term this novel training paradigm as two-shot video object segmentation, or two-shot VOS for short. The underlying idea is to generate pseudo labels for unlabeled frames during training and to optimize the model on the combination of labeled and pseudo-labeled data. Our approach is extremely simple and can be applied to a majority of existing frameworks. We first pre-train a VOS model on sparsely annotated videos in a semi-supervised manner, with the first frame always being a labeled one. Then, we adopt the pre-trained VOS model to generate pseudo labels for all unlabeled frames, which are subsequently stored in a pseudo-label bank. Finally, we retrain a VOS model on both labeled and pseudo-labeled data without any restrictions on the first frame. For the first time, we present a general way to train VOS models on two-shot VOS datasets. By using 7.3% and 2.9% labeled data of YouTube-VOS and DAVIS benchmarks, our approach achieves comparable results in contrast to the counterparts trained on fully labeled set. Code and models are available at https://github.com/yk-pku/Two-shot-Video-Object-Segmentation.
翻译:先前视频目标分割(VOS)的研究是在密集注释的视频上进行的。然而,获得像素级注释是昂贵和耗时的。在此工作中,我们展示了在稀疏注释的视频上训练仅需要两个标注帧的令人满意的VOS模型的可行性-在维持性能的同时。我们将这种新颖的训练范式称为两帧视频目标分割,简称为两帧VOS。其基本思想是在训练期间对未标注帧生成伪标签,并在标记和伪标签数据的组合上优化模型。我们的方法非常简单,可以应用于大多数现有框架。我们首先以半监督的方式在稀疏注释的视频上预训练VOS模型,其中第一帧始终为标注帧。然后,我们采用经过预训练的VOS模型,为所有未标注帧生成伪标签,这些伪标签随后存储在伪标签库中。最后,我们在标记和伪标签数据上重新训练VOS模型,不限制第一帧。我们首次提出了一种在两帧VOS数据集上训练VOS模型的通用方法。通过使用YouTube-VOS和DAVIS基准测试的7.3%和2.9%的标记数据,我们的方法在对比完全标记集上训练的对应模型时取得了可比的结果。代码和模型可从https://github.com/yk-pku/Two-shot-Video-Object-Segmentation获得。