Instance level video object segmentation is an important technique for video editing and compression. To capture the temporal coherence, in this paper, we develop MaskRNN, a recurrent neural net approach which fuses in each frame the output of two deep nets for each object instance -- a binary segmentation net providing a mask and a localization net providing a bounding box. Due to the recurrent component and the localization component, our method is able to take advantage of long-term temporal structures of the video data as well as rejecting outliers. We validate the proposed algorithm on three challenging benchmark datasets, the DAVIS-2016 dataset, the DAVIS-2017 dataset, and the Segtrack v2 dataset, achieving state-of-the-art performance on all of them.
翻译:为了捕捉时间一致性,我们在本文件中开发了MaskRNN, 这是一种经常性神经网方法,在每条框架中结合每个物体实例两个深网的输出 -- -- 一个二进分解网,提供一个掩码,一个本地化网,提供一个捆绑框。由于经常部分和本地化部分,我们的方法能够利用视频数据的长期时间结构以及拒绝外部数据。我们验证了三个具有挑战性的基准数据集的拟议算法:DAVIS-2016数据集、DAVIS-2017数据集和Seg track v2数据集,所有这些数据集都取得了最先进的业绩。