The goal of video segmentation is to accurately segment and track every pixel in diverse scenarios. In this paper, we present Tube-Link, a versatile framework that addresses multiple core tasks of video segmentation with a unified architecture. Our framework is a near-online approach that takes a short subclip as input and outputs the corresponding spatial-temporal tube masks. To enhance the modeling of cross-tube relationships, we propose an effective way to perform tube-level linking via attention along the queries. In addition, we introduce temporal contrastive learning to instance-wise discriminative features for tube-level association. Our approach offers flexibility and efficiency for both short and long video inputs, as the length of each subclip can be varied according to the needs of datasets or scenarios. Tube-Link outperforms existing specialized architectures by a significant margin on five video segmentation datasets. Specifically, it achieves almost 13% relative improvements on VIPSeg and 4% improvements on KITTI-STEP over the strong baseline Video K-Net. When using a ResNet50 backbone on Youtube-VIS-2019 and 2021, Tube-Link boosts IDOL by 3% and 4%, respectively. Code will be available.
翻译:本文旨在准确分割和跟踪不同场景中的每个像素,为此我们提出了 Tube-Link,这是一个多才多艺的框架,集成了视频分割的多个核心任务。我们的框架是一种近似在线的方法,它以短小的子剪辑为输入,输出相应的时空管道掩模。为了增强交叉管道的建模能力,我们提出了一种有效的通过注意力沿查询进行管道级别链接的方法。此外,我们引入了时间对比学习来实现管道级别关联的实例鉴别特征。我们的方法提供了对短视频输入和长视频输入的灵活性和效率,因为每个子剪辑的长度可以根据数据集或场景的需求进行变化。Tube-Link 在五个视频分割数据集上的表现优于现有的专业架构。特别地,在 VIPSeg 上,相对于强大的基线 Video K-Net,它提高了近 13% 的相对改进,并在 KITTI-STEP 上提高了 4%。在使用 ResNet50 后端的 Youtube-VIS-2019 和 2021 上,Tube-Link 分别将 IDOL 提高了 3% 和 4%。代码将会公开。