Video Instance Segmentation(VIS) aims at segmenting and categorizing objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. To address this limitation, we make the following three contributions. First, we introduce the novel task of Open-Vocabulary Video Instance Segmentation, which aims to simultaneously segment, track, and classify objects in videos from open-set categories, including novel categories unseen during training. Second, to benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance Segmentation dataset(LV-VIS), that contains well-annotated objects from 1,212 diverse categories, significantly surpassing the category size of existing datasets by more than one order of magnitude. Third, we propose an efficient Memory-Induced Vision-Language Transformer, MindVLT, to first achieve Open-Vocabulary VIS in an end-to-end manner with near real-time inference speed. Extensive experiments on LV-VIS and four existing VIS datasets demonstrate the strong zero-shot generalization ability of MindVLT on novel categories. We will release the dataset and code to facilitate future endeavors.
翻译:视频实例分割(VIS)旨在从训练数据的有限类别中分割和分类视频中的对象,缺乏处理实际世界视频中未知类别的普适性。为了解决这个局限性,我们做出以下三个贡献。首先,我们引入了开放词汇视频实例分割的新任务,它旨在同时从开放的词汇中分割、跟踪和分类视频中的对象,包括在训练期间未见过的新类别。其次,为了基准测试开放词汇 VIS,我们收集了一个大词汇视频实例分割数据集(LV-VIS),其中包含来自 1,212 种不同类别的良好注释对象,比现有数据集的类别数量超过一个数量级。第三,我们提出了一种高效的记忆诱导视觉语言变换器 MindVLT,首次以近实时推断速度以端到端的方式实现了开放词汇 VIS。在 LV-VIS 和四个现有的 VIS 数据集上的大量实验证明了 MindVLT 在新类别上的强大的零样本泛化能力。我们将发布数据集和代码以促进未来的研究努力。