Existing matching-based approaches perform video object segmentation (VOS) via retrieving support features from a pixel-level memory, while some pixels may suffer from lack of correspondence in the memory (i.e., unseen), which inevitably limits their segmentation performance. In this paper, we present a Two-Stream Network (TSN). Our TSN includes (i) a pixel stream with a conventional pixel-level memory, to segment the seen pixels based on their pixellevel memory retrieval. (ii) an instance stream for the unseen pixels, where a holistic understanding of the instance is obtained with dynamic segmentation heads conditioned on the features of the target instance. (iii) a pixel division module generating a routing map, with which output embeddings of the two streams are fused together. The compact instance stream effectively improves the segmentation accuracy of the unseen pixels, while fusing two streams with the adaptive routing map leads to an overall performance boost. Through extensive experiments, we demonstrate the effectiveness of our proposed TSN, and we also report state-of-the-art performance of 86.1% on YouTube-VOS 2018 and 87.5% on the DAVIS-2017 validation split.
翻译:基于匹配的现有方法通过从像素级记忆中检索支持功能,对视像对象进行分解(VOS),从像素级记忆中获取支持功能,而有些像素可能因记忆(即看不见的)中缺少通信而受到影响,这不可避免地限制了它们的分解性能。在本文中,我们提出了一个双层网络。我们的 TSN 包括 (一) 一个具有传统像素级记忆的像素流,到根据像素级记忆检索功能的像素分解。 (二) 一种以像素级记忆为主的无形像素实例流,其中以目标实例特征为条件的动态分解头获得对象素的整体理解。 (三) 一个像素分解模块生成一条路标图,将两种流的输出嵌入连接在一起。 紧凑的像素流可以有效地提高视像素的分解精度,同时用适应性路标图的两条流来推动总体的性能提升。 通过广泛的实验,我们展示了拟议的TSNSN的效果,我们还报告了以目标实例实例实例显示的动态区段值,以及DAVA1至20185的状态验证。