We propose an efficient inference framework for semi-supervised video object segmentation by exploiting the temporal redundancy of the video. Our method performs inference on selected keyframes and makes predictions for other frames via propagation based on motion vectors and residuals from the compressed video bitstream. Specifically, we propose a new motion vector-based warping method for propagating segmentation masks from keyframes to other frames in a multi-reference manner. Additionally, we propose a residual-based refinement module that can correct and add detail to the block-wise propagated segmentation masks. Our approach is flexible and can be added on top of existing video object segmentation algorithms. With STM with top-k filtering as our base model, we achieved highly competitive results on DAVIS16 and YouTube-VOS with substantial speedups of up to 4.9X with little loss in accuracy.
翻译:我们通过利用视频的时间冗余,为半受监督的视频对象分割提出了有效的推断框架。我们的方法是对选定的键盘进行推断,并根据运动矢量和压缩视频位流残留物进行传播,对其他框架作出预测。具体地说,我们提出了一个新的基于运动的矢量扭曲方法,以多种参考方式将隔断面从键盘传播到其他框架。此外,我们建议了一个基于残余的改进模块,该模块可以纠正并增加块状传播的分割面罩的细节。我们的方法是灵活的,可以添加在现有视频对象分割算法之上。用以顶端过滤器作为我们的基础模型的STM,我们在DAVIS16和YouTube-VOS上取得了高度竞争性的结果,其速度高达4.9X,其准确性微小。