This paper investigates the principles of embedding learning to tackle the challenging semi-supervised video object segmentation. Unlike previous practices that focus on exploring the embedding learning of foreground object (s), we consider background should be equally treated. Thus, we propose a Collaborative video object segmentation by Foreground-Background Integration (CFBI) approach. CFBI separates the feature embedding into the foreground object region and its corresponding background region, implicitly promoting them to be more contrastive and improving the segmentation results accordingly. Moreover, CFBI performs both pixel-level matching processes and instance-level attention mechanisms between the reference and the predicted sequence, making CFBI robust to various object scales. Based on CFBI, we introduce a multi-scale matching structure and propose an Atrous Matching strategy, resulting in a more robust and efficient framework, CFBI+. We conduct extensive experiments on two popular benchmarks, i.e., DAVIS and YouTube-VOS. Without applying any simulated data for pre-training, our CFBI+ achieves the performance (J&F) of 82.9% and 82.8%, outperforming all the other state-of-the-art methods. Code: https://github.com/z-x-yang/CFBI.
翻译:本文探讨了嵌入学习以解决具有挑战性的半监督视频物体分割现象的原则。与以往侧重于探索嵌入前景对象(s)的做法不同,我们认为背景应当受到同等对待。因此,我们提议由地表-地表整合(CFBI)方法对视频目标分割进行协作性视频分离。CFBI将嵌入地表物体区域及其相应背景区域的特点区分开来,隐含地促进这些特征,使其更具对比性,并相应改善分割结果。此外,CFBI在参考和预测顺序之间,既执行像素级匹配程序,又执行像素级关注机制,使CFBI对不同天标的尺度具有牢固性。基于CFBI,我们引入了多尺度匹配结构,并提出了Astroming匹配战略,从而形成更有力和高效的框架(CFBI+)。我们广泛试验了两种流行基准,即DAVIS和YouTube-VOS。在培训前不应用任何模拟数据的情况下,我们的CBI+在参考和预测顺序之间,使CBIBI(J&F)实现82.%/M.BIBI/88/CRisx_CR%的所有其他方法。