In this paper we introduce SiamMask, a framework to perform both visual object tracking and video object segmentation, in real-time, with the same simple method. We improve the offline training procedure of popular fully-convolutional Siamese approaches by augmenting their losses with a binary segmentation task. Once the offline training is completed, SiamMask only requires a single bounding box for initialization and can simultaneously carry out visual object tracking and segmentation at high frame-rates. Moreover, we show that it is possible to extend the framework to handle multiple object tracking and segmentation by simply re-using the multi-task model in a cascaded fashion. Experimental results show that our approach has high processing efficiency, at around 55 frames per second. It yields real-time state-of-the-art results on visual-object tracking benchmarks, while at the same time demonstrating competitive performance at a high speed for video object segmentation benchmarks.
翻译:在本文中,我们引入了SiamMask, 这是一个实时进行视觉物体跟踪和视频物体分离的框架, 以同样的简单方法进行实时的视觉物体跟踪和视频物体分离。 我们通过二进制分解任务, 来增加全进式暹粒方法的损失, 以此改进流行的离线培训程序。 离线培训完成后, SiamMask 只需要一个单一的捆绑框就可以初始化, 就可以同时在高框架速率下进行视觉物体跟踪和分离。 此外, 我们表明, 我们有可能扩展这个框架, 处理多个物体跟踪和分离, 简单地以级联方式重新使用多任务模型。 实验结果显示, 我们的方法具有高处理效率, 大约每秒55个框架。 它在视觉物体跟踪基准上产生实时最新艺术效果, 同时显示视频物体分解基准的高速度具有竞争性性。