视频实例分割在开放世界中的应用 (Video Instance Segmentation in an Open-World)

Existing video instance segmentation (VIS) approaches generally follow a closed-world assumption, where only seen category instances are identified and spatio-temporally segmented at inference. Open-world formulation relaxes the close-world static-learning assumption as follows: (a) first, it distinguishes a set of known categories as well as labels an unknown object as `unknown' and then (b) it incrementally learns the class of an unknown as and when the corresponding semantic labels become available. We propose the first open-world VIS approach, named OW-VISFormer, that introduces a novel feature enrichment mechanism and a spatio-temporal objectness (STO) module. The feature enrichment mechanism based on a light-weight auxiliary network aims at accurate pixel-level (unknown) object delineation from the background as well as distinguishing category-specific known semantic classes. The STO module strives to generate instance-level pseudo-labels by enhancing the foreground activations through a contrastive loss. Moreover, we also introduce an extensive experimental protocol to measure the characteristics of OW-VIS. Our OW-VISFormer performs favorably against a solid baseline in OW-VIS setting. Further, we evaluate our contributions in the standard fully-supervised VIS setting by integrating them into the recent SeqFormer, achieving an absolute gain of 1.6\% AP on Youtube-VIS 2019 val. set. Lastly, we show the generalizability of our contributions for the open-world detection (OWOD) setting, outperforming the best existing OWOD method in the literature. Code, models along with OW-VIS splits are available at \url{https://github.com/OmkarThawakar/OWVISFormer}.

翻译：现有的视频实例分割方法通常遵循封闭世界的假设，在推断过程中仅识别和时空分割已知类别实例。开放世界的处理方式放宽了封闭静态学习的假设，具体来说：(a) 首先，将一组已知类别进行区分，同时将未知对象标记为“未知”，然后(b)在对应的语义标签库可用时逐步学习未知类。我们提出了第一个开放世界的视频实例分割方法，命名为OW-VISFormer，它引入了一种新颖的特征增强机制和一个时空物体性(STO)模块。基于轻量级辅助网络的特征增强机制旨在准确地从背景中分离出像素级(未知)对象，并区分特定类别的已知语义类。STO模块通过对比损失来通过增强前景激活来生成实例级伪标签。此外，我们还介绍了一个广泛的实验协议来衡量OW-VIS的特性。我们的OW-VISFormer在OW-VIS设置中表现优异。此外，我们还将我们的贡献评估到标准的全监督VIS设置中，将其整合到最近的SeqFormer中，在Youtube-VIS 2019 val上实现了1.6％的AP绝对增益。最后，我们展示了我们的贡献在开放世界检测(OWOD)设置中的通用性，在文献中超越了最好的现有OWOD方法。代码、模型以及OW-VIS拆分可在\url{https://github.com/OmkarThawakar/OWVISFormer}上找到。