Video Instance Segmentation (VIS) is a new and inherently multi-task problem, which aims to detect, segment, and track each instance in a video sequence. Existing approaches are mainly based on single-frame features or single-scale features of multiple frames, where either temporal information or multi-scale information is ignored. To incorporate both temporal and scale information, we propose a Temporal Pyramid Routing (TPR) strategy to conditionally align and conduct pixel-level aggregation from a feature pyramid pair of two adjacent frames. Specifically, TPR contains two novel components, including Dynamic Aligned Cell Routing (DACR) and Cross Pyramid Routing (CPR), where DACR is designed for aligning and gating pyramid features across temporal dimension, while CPR transfers temporally aggregated features across scale dimension. Moreover, our approach is a light-weight and plug-and-play module and can be easily applied to existing instance segmentation methods. Extensive experiments on three datasets including YouTube-VIS (2019, 2021) and Cityscapes-VPS demonstrate the effectiveness and efficiency of the proposed approach on several state-of-the-art video instance and panoptic segmentation methods. Codes will be publicly available at \url{https://github.com/lxtGH/TemporalPyramidRouting}.
翻译:视频分层(VIS)是一个全新的、固有的多任务问题,目的是在视频序列中检测、分段和跟踪每个实例。现有方法主要基于多个框架的单一框架特征或单一尺度特征,其中不考虑时间信息或多尺度信息。为了纳入时间和规模信息,我们建议采用时空金字塔路流(TPR)战略,有条件地对齐和进行由两个相邻框架组成的特质金字塔对齐的像素级聚合。具体地说,TRP包含两个新组成部分,包括动态统一细胞路程(DACR)和跨金字塔路流(Cross Pyramid Rout),其中DACR旨在对金字塔特征进行跨时间维度的调整和定位,而CPR则将时间综合特征跨越尺度维度。此外,我们的方法是一个轻量和插播模块,可以很容易应用到现有的分层方法。在三个数据集上进行广泛的实验,包括YouTube-VIS(2019,2021)和Cityscase-VPS,展示了在几个州-affral-rual-Arpal-compal-commation方法上拟议的方法的有效性和效率。