Recently, handling long videos of complex and occluded sequences has emerged as a new challenge in the video instance segmentation (VIS) community. However, existing methods show limitations in addressing the challenge. We argue that the biggest bottleneck in current approaches is the discrepancy between the training and the inference. To effectively bridge the gap, we propose a \textbf{Gen}eralized framework for \textbf{VIS}, namely \textbf{GenVIS}, that achieves the state-of-the-art performance on challenging benchmarks without designing complicated architectures or extra post-processing. The key contribution of GenVIS is the learning strategy. Specifically, we propose a query-based training pipeline for sequential learning, using a novel target label assignment strategy. To further fill the remaining gaps, we introduce a memory that effectively acquires information from previous states. Thanks to the new perspective, which focuses on building relationships between separate frames or clips, GenVIS can be flexibly executed in both online and semi-online manner. We evaluate our methods on popular VIS benchmarks, YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS), achieving state-of-the-art results. Notably, we greatly outperform the state-of-the-art on the long VIS benchmark (OVIS), improving 5.6 AP with ResNet-50 backbone. Code will be available at https://github.com/miranheo/GenVIS.
翻译:最近,处理复杂和隐蔽序列的长视频在视频实例分类(VIS)中已成为一项新的挑战。然而,现有方法显示应对挑战的局限性。我们认为,当前方法中最大的瓶颈是培训与推断之间的差距。为了有效弥合差距,我们提议为\ textbf{Gen}(textbf{GenviS}),即\ textbf{{GenVIS}建立一个网络化框架,以在挑战性基准方面实现最先进的业绩,而不设计复杂的结构或额外的后处理。GenVIS的主要贡献是学习战略。具体地说,我们提出一个基于询问的培训管道,用于连续学习,使用新的目标分配战略。为了进一步填补剩余差距,我们引入一个记忆,以有效获取来自前几个州的信息。由于新的视角,侧重于建立不同的框架或剪辑之间的关系,GenVIS将可以在在线和半在线方式灵活地执行。我们评估了我们关于通用VIS基准的方法,即You-VIS-VIS-VIS-Shar_BS 2019S-CF-G-G-G-G-G-IG-IG-IG-IG-IG-IG-IG-G-G-G-G-IG-G-G-G-IG-IG-IG-IG-G-G-IG-IG-G-G-G-G-G-G-G-G-G-GVG-G-IG-G-G-IG-G-G-G-G-G-G-G-G-IG-IG-G-G-G-IG-G-G-G-G-G-IG-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-