视频实例分割的基于标签的注意引导自下而上的方法 (Tag-Based Attention Guided Bottom-Up Approach for Video Instance Segmentation)

Video Instance Segmentation is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence. Most existing methods typically accomplish this task by employing a multi-stage top-down approach that usually involves separate networks to detect and segment objects in each frame, followed by associating these detections in consecutive frames using a learned tracking head. In this work, however, we introduce a simple end-to-end trainable bottom-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach. Unlike contemporary frame-based models, our network pipeline processes an input video clip as a single 3D volume to incorporate temporal information. The central idea of our formulation is to solve the video instance segmentation task as a tag assignment problem, such that generating distinct tag values essentially separates individual object instances across the video sequence (here each tag could be any arbitrary value between 0 and 1). To this end, we propose a novel spatio-temporal tagging loss that allows for sufficient separation of different objects as well as necessary identification of different instances of the same object. Furthermore, we present a tag-based attention module that improves instance tags, while concurrently learning instance propagation within a video. Evaluations demonstrate that our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other state-of-the-art performance methods.

翻译：视频截图是一个基本的计算机愿景任务,它涉及视频序列中的分割和跟踪对象实例。大多数现有方法通常通过多阶段自上而下的方法完成这项任务,通常使用多阶段的自上而下的方法,通常使用不同的网络来检测和分割每个框中的对象,然后用一个学习的跟踪头将这些探测结果连接在连续的框中。然而,在这项工作中,我们引入一个简单的端到端的自下而上的培训方法,以便在像素级颗粒度上实现实例遮掩预测,而不是典型的区域提案法。与当代基于框架的模式不同,我们的网络管道将一个输入视频剪作为单一的3D卷处理,以纳入时间信息。我们设计的中心理念是解决视频实例分割任务,将其作为一个标签分配问题,从而产生不同的标记值基本上将视频序列中的单个对象实例区分开来(这里的每个标记可能是0到1之间的任意值 ) 。为此,我们提出了一个新的时空标记标记损失,以便能够充分区分不同对象,并有必要识别同一对象的不同实例。此外,我们展示一个基于标签的标签的图像模型,同时展示一个演示式S- 并展示我们用来学习具有竞争性的SVVIS 和DA的标签式的演示式的模型。