In this paper, we propose Contextual Guided Segmentation (CGS) framework for video instance segmentation in three passes. In the first pass, i.e., preview segmentation, we propose Instance Re-Identification Flow to estimate main properties of each instance (i.e., human/non-human, rigid/deformable, known/unknown category) by propagating its preview mask to other frames. In the second pass, i.e., contextual segmentation, we introduce multiple contextual segmentation schemes. For human instance, we develop skeleton-guided segmentation in a frame along with object flow to correct and refine the result across frames. For non-human instance, if the instance has a wide variation in appearance and belongs to known categories (which can be inferred from the initial mask), we adopt instance segmentation. If the non-human instance is nearly rigid, we train FCNs on synthesized images from the first frame of a video sequence. In the final pass, i.e., guided segmentation, we develop a novel fined-grained segmentation method on non-rectangular regions of interest (ROIs). The natural-shaped ROI is generated by applying guided attention from the neighbor frames of the current one to reduce the ambiguity in the segmentation of different overlapping instances. Forward mask propagation is followed by backward mask propagation to further restore missing instance fragments due to re-appeared instances, fast motion, occlusion, or heavy deformation. Finally, instances in each frame are merged based on their depth values, together with human and non-human object interaction and rare instance priority. Experiments conducted on the DAVIS Test-Challenge dataset demonstrate the effectiveness of our proposed framework. We achieved the 3rd consistently in the DAVIS Challenges 2017-2019 with 75.4%, 72.4%, and 78.4% in terms of global score, region similarity, and contour accuracy, respectively.
翻译:在本文中, 我们提出“ 环境向导分解( CGS) 框架 ”, 用于视频的分解 。 在第一个路口中, 即 预览分解, 我们提议“ 试镜重新识别流程 ”, 通过向其它框架传播预览面罩( 即 人/ 非人、 僵硬/ 变形、 已知/ 未知类别) 来估计每个实例的主要属性。 在第二个路口中, 即 环境分解, 我们引入多个背景分解方案。 在人类实例中, 我们开发一个框架, 与对象流动一起, 以纠正和完善跨框架的结果。 在非目标框架中, 我们提出“ 骨质导” 的深度分解( 预览 ) 。 对于非直观的外观和属于已知类别( 可以从最初的掩码中推断出), 我们采用实例 。 如果非人性化 掩码, 我们用视频序列第一个框架中的合成图像, 我们引入了 FCN 。 。 在最后通道中, 解析中,,, 解析中, 和 解析, 我们开发了一个新的分解,,,,,,, 和 以非反向后变变变变变变变变的, 在非正变变变的 以 以 方向 向的, 以,, 方向 向 向, 向,,, 向 向 向 向,, 向 向,,,,,, 向,, 向 向, 向,,, 向 向 向,,,, 向 向 的 的 的 向 向,,,,,,,, 向,, 向 向,,,, 向 向,,,, 向 向 向 向,,,,, 向,,, 向 向 向 向,,, 向,,,, 向