Video Object Segmentation (VOS) has been targeted by various fully-supervised and self-supervised approaches. While fully-supervised methods demonstrate excellent results, self-supervised ones, which do not use pixel-level ground truth, attract much attention. However, self-supervised approaches pose a significant performance gap. Box-level annotations provide a balanced compromise between labeling effort and result quality for image segmentation but have not been exploited for the video domain. In this work, we propose a box-supervised video object segmentation proposal network, which takes advantage of intrinsic video properties. Our method incorporates object motion in the following way: first, motion is computed using a bidirectional temporal difference and a novel bounding box-guided motion compensation. Second, we introduce a novel motion-aware affinity loss that encourages the network to predict positive pixel pairs if they share similar motion and color. The proposed method outperforms the state-of-the-art self-supervised benchmark by 16.4% and 6.9% $\mathcal{J}$ &$\mathcal{F}$ score and the majority of fully supervised methods on the DAVIS and Youtube-VOS dataset without imposing network architectural specifications. We provide extensive tests and ablations on the datasets, demonstrating the robustness of our method.
翻译:视频对象分割( VOS) 已被各种完全监管和自我监督的方法所锁定。 虽然完全监管的方法展示出优异的结果, 但自监管的方法却能吸引很多注意力。 然而, 自监管的方法造成了显著的性能差距。 框层次的注释提供了标签努力和图像分割结果质量之间的平衡折中, 但没有为视频域开发。 在这项工作中, 我们提议了一个利用内在视频属性的由盒子监管的视频对象分割建议网络。 我们的方法以下列方式结合了对象运动: 首先, 运动是使用双向时间差异和新颖的组合框引导运动补偿来计算。 其次, 我们引入了一种新型运动认知的亲近性损失, 从而鼓励网络在拥有相似动作和颜色的情况下预测正像对配方。 拟议的方法将我们最先进的自我监管的视频对象基准比方化为16.4 % 和 6.9 % mathal{J} 。 首先, 使用双向时间偏差的时间差异来计算动作, 并使用新型的框框框框框 补偿方法 。 我们对DOF} 的多数数据进行测试。