We present a refinement framework to boost the performance of pre-trained semi-supervised video object segmentation (VOS) models. Our work is based on scale inconsistency, which is motivated by the observation that existing VOS models generate inconsistent predictions from input frames with different sizes. We use the scale inconsistency as a clue to devise a pixel-level attention module that aggregates the advantages of the predictions from different-size inputs. The scale inconsistency is also used to regularize the training based on a pixel-level variance measured by an uncertainty estimation. We further present a self-supervised online adaptation, tailored for test-time optimization, that bootstraps the predictions without ground-truth masks based on the scale inconsistency. Experiments on DAVIS 16 and DAVIS 17 datasets show that our framework can be generically applied to various VOS models and improve their performance.
翻译:我们提出了一个改进框架,以提高经过训练的半监督视频物体分离模型的性能。我们的工作基于规模不一致,其动机是观察到现有的VOS模型产生不同大小输入框架的预测不一致。我们用规模不一致作为线索,设计一个像素级关注模块,汇总不同规模投入预测的优点。规模不一致还用于根据不确定性估计测量的像素水平差异来规范培训。我们还提出一种自监督的在线适应,根据测试-时间的优化进行自我监督,将没有基于规模不一致的地面真相蒙面的预测捆绑起来。DAVIS 16和DAVIS 17数据集的实验表明,我们的框架可以通用地应用于各种VOS模型并改进其性能。