Video Instance Segmentation (VIS) jointly tackles multi-object detection, tracking, and segmentation in video sequences. In the past, VIS methods mirrored the fragmentation of these subtasks in their architectural design, hence missing out on a joint solution. Transformers recently allowed to cast the entire VIS task as a single set-prediction problem. Nevertheless, the quadratic complexity of existing Transformer-based methods requires long training times, high memory requirements, and processing of low-single-scale feature maps. Deformable attention provides a more efficient alternative but its application to the temporal domain or the segmentation task have not yet been explored. In this work, we present Deformable VIS (DeVIS), a VIS method which capitalizes on the efficiency and performance of deformable Transformers. To reason about all VIS subtasks jointly over multiple frames, we present temporal multi-scale deformable attention with instance-aware object queries. We further introduce a new image and video instance mask head with multi-scale features, and perform near-online video processing with multi-cue clip tracking. DeVIS reduces memory as well as training time requirements, and achieves state-of-the-art results on the YouTube-VIS 2021, as well as the challenging OVIS dataset. Code is available at https://github.com/acaelles97/DeVIS.
翻译:视频分割(VIS) 共处理多点检测、跟踪和视频序列中的分解。 过去, VIS 方法反映了这些子任务在建筑设计中的支离破碎,因此遗漏了共同解决方案。 变异器最近允许将整个VIS任务作为一个单一设定选项问题。 然而,现有基于变异器的方法的四分复杂度要求长时间培训、高记忆要求和处理低单级地貌地图。 可变式关注提供了一种更高效的替代方法,但是尚未探索其在时间域或分解任务中的应用。 在这项工作中,我们展示了可变异的VIS(DeVIS) (DVIS),这是利用变形变异变异器的效率和性功能的一种VIS 方法。为了将所有VIS 子任务合并成一个单一设定选项的原因,我们展示了时间跨度的多尺度可变异性关注,并处理低单级地标问询。 我们进一步引入了带有多级特征的新图像和视频实例掩码头,并进行近线视频处理,同时进行多级域域域域域跟踪跟踪。 DEVIS-S 降低了可变式系统在20S- State- State- State- Stat- Stat- Stat- States