Recently, Space-Time Memory Network (STM) based methods have achieved state-of-the-art performance in semi-supervised video object segmentation (VOS). A crucial problem in this task is how to model the dependency both among different frames and inside every frame. However, most of these methods neglect the spatial relationships (inside each frame) and do not make full use of the temporal relationships (among different frames). In this paper, we propose a new transformer-based framework, termed TransVOS, introducing a vision transformer to fully exploit and model both the temporal and spatial relationships. Moreover, most STM-based approaches employ two separate encoders to extract features of two significant inputs, i.e., reference sets (history frames with predicted masks) and query frame (current frame), respectively, increasing the models' parameters and complexity. To slim the popular two-encoder pipeline while keeping the effectiveness, we design a single two-path feature extractor to encode the above two inputs in a unified way. Extensive experiments demonstrate the superiority of our TransVOS over state-of-the-art methods on both DAVIS and YouTube-VOS datasets.
翻译:最近,基于空间-时记忆网络(STM)的方法在半监控视频对象分割(VOS)中达到了最新水平的性能。这项任务中的一个关键问题是如何在不同的框架和每个框架内建模依赖性。然而,大多数这些方法忽视了空间关系(每个框架),没有充分利用时间关系(在不同框架中 ) 。 在本文件中,我们提议一个新的基于变压器的框架,称为 TransVOS,引入一个视野变异器,以充分利用和模拟时间和空间关系。此外,大多数基于STM的方法使用两个独立的编码器来提取两个重要投入的特征,即参考集(带有预测的面具的历史框架)和查询框架(当前框架),分别增加模型参数和复杂性。要缩小流行的双电码管道,同时保持有效性,我们设计一个单一的双向特征提取器,以统一的方式将以上两种投入编码。广泛的实验显示我们的TransVOS优先于DVIS和YouTube-VOS数据设置的状态方法。