In Video Instance Segmentation (VIS), current approaches either focus on the quality of the results, by taking the whole video as input and processing it offline; or on speed, by handling it frame by frame at the cost of competitive performance. In this work, we propose an online method that is on par with the performance of the offline counterparts. We introduce a message-passing graph neural network that encodes objects and relates them through time. We additionally propose a novel module to fuse features from the feature pyramid network with residual connections. Our model, trained end-to-end, achieves state-of-the-art performance on the YouTube-VIS dataset within the online methods. Further experiments on DAVIS demonstrate the generalization capability of our model to the video object segmentation task. Code is available at: \url{https://github.com/caganselim/TLTM}
翻译:在视频分割(VIS)中,目前的做法要么侧重于结果的质量,通过将整个视频作为输入和脱线处理;要么侧重于速度,通过以竞争性性能为代价按框架处理它。在这项工作中,我们提出了一种与离线对应方的性能相当的在线方法。我们引入了一个电文传递图形神经网络,对物体进行编码,并及时与它们连接。我们还提议了一个新模块,将具有剩余连接的特长金字塔网络的特征融合起来。我们的模型,经过培训的终端到终端,在网上方法中实现了YouTube-VIS数据集的最先进的性能。关于DAVIS的进一步实验展示了我们模型在视频物体分割任务方面的通用能力。代码可在以下网址查阅:https://github.com/caganselim/TLTM}