Video object segmentation, aiming to segment the foreground objects given the annotation of the first frame, has been attracting increasing attentions. Many state-of-the-art approaches have achieved great performance by relying on online model updating or mask-propagation techniques. However, most online models require high computational cost due to model fine-tuning during inference. Most mask-propagation based models are faster but with relatively low performance due to failure to adapt to object appearance variation. In this paper, we are aiming to design a new model to make a good balance between speed and performance. We propose a model, called NPMCA-net, which directly localizes foreground objects based on mask-propagation and non-local technique by matching pixels in reference and target frames. Since we bring in information of both first and previous frames, our network is robust to large object appearance variation, and can better adapt to occlusions. Extensive experiments show that our approach can achieve a new state-of-the-art performance with a fast speed at the same time (86.5% IoU on DAVIS-2016 and 72.2% IoU on DAVIS-2017, with speed of 0.11s per frame) under the same level comparison. Source code is available at https://github.com/siyueyu/NPMCA-net.
翻译:在第一个框架的说明下,旨在分割前景对象的视频对象部分,目的是根据第一个框架的注释对前景对象进行分割,因此吸引了越来越多的注意力。许多最先进的方法通过依靠在线模型更新或遮罩推进技术取得了出色的业绩。然而,大多数在线模型由于在推断过程中进行模型微调,需要较高的计算成本。大多数基于遮罩的软件模型速度较快,但由于无法适应目标外观变异,其性能相对较低。在本文中,我们的目标是设计一个新的模型,以在速度和性能之间取得良好的平衡。我们提出了一种模型,称为NAMCCA-net,该模型通过在参考和目标框中匹配像素和非本地技术,直接将基于面罩的表面物体本地化和非本地化。由于我们提供了第一个和以前框架的模型的微调,我们的网络对大型物体外观变异非常强大,并且由于无法更好地适应隐蔽性。广泛的实验表明,我们的方法可以在同一时间(DAVIS-2016的IOU和72.2%的IAVIAVA/SSS Sirvial orume)下,可以与1.DAVI/DAVI/Slvialviaxalviaxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx。