The handling of long videos with complex and occluded sequences has recently emerged as a new challenge in the video instance segmentation (VIS) community. However, existing methods have limitations in addressing this challenge. We argue that the biggest bottleneck in current approaches is the discrepancy between training and inference. To effectively bridge this gap, we propose a Generalized framework for VIS, namely GenVIS, that achieves state-of-the-art performance on challenging benchmarks without designing complicated architectures or requiring extra post-processing. The key contribution of GenVIS is the learning strategy, which includes a query-based training pipeline for sequential learning with a novel target label assignment. Additionally, we introduce a memory that effectively acquires information from previous states. Thanks to the new perspective, which focuses on building relationships between separate frames or clips, GenVIS can be flexibly executed in both online and semi-online manner. We evaluate our approach on popular VIS benchmarks, achieving state-of-the-art results on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS). Notably, we greatly outperform the state-of-the-art on the long VIS benchmark (OVIS), improving 5.6 AP with ResNet-50 backbone. Code is available at https://github.com/miranheo/GenVIS.
翻译:处理复杂和遮挡视频序列的长视频最近已成为视频实例分割(VIS)社区的一项新挑战。然而,现有方法在应对这一挑战方面存在局限性。笔者认为,当前方法中最大的瓶颈在于训练和推理之间的差距。为了有效地弥合这一差距,我们提出了一种通用的VIS框架,名为GenVIS,它在不需要设计复杂的架构或要求额外的后处理的情况下在具有挑战性的基准上实现了最先进的性能。GenVIS的关键贡献是学习策略,其中包括一个基于查询的训练管道,用于具有新型目标标签分配的顺序学习。此外,我们引入了一个存储器,有效地从之前状态中获取信息。由于这种关注构建独立帧或剪辑之间关系的新视角,GenVIS可以在在线和半在线方式下灵活执行。我们在流行的VIS基准上评估了我们的方法,取得了YouTube-VIS 2019/2021/2022和Occluded VIS(OVIS)的最先进结果。值得注意的是,我们在长VIS基准(OVIS)上大大优于最先进的方法,在ResNet-50骨干网上提高了5.6 AP。代码可在https://github.com/miranheo/GenVIS上找到。