Recently, Space-Time Memory Network (STM) based methods have achieved state-of-the-art performance in semi-supervised video object segmentation (VOS). A critical problem in this task is how to model the dependency both among different frames and inside every frame. However, most of these methods neglect the spatial relationships (inside each frame) and do not make full use of the temporal relationships (among different frames). In this paper, we propose a new transformer-based framework, termed TransVOS, introducing a vision transformer to fully exploit and model both the temporal and spatial relationships. Moreover, most STM-based approaches employ two disparate encoders to extract features of two significant inputs, i.e., reference sets (history frames with predicted masks) and query frame, respectively, increasing the models' parameters and complexity. To slim the popular two-encoder pipeline while keeping the effectiveness, we design a single two-path feature extractor to encode the above two inputs in a unified way. Extensive experiments demonstrate the superiority of our TransVOS over state-of-the-art methods on both DAVIS and YouTube-VOS datasets. Codes will be released when it is published.
翻译:最近,基于空间-时记忆网络(STM)的方法在半监控视频对象分割(VOS)中达到了最新水平的性能。这项任务中的一个关键问题是如何在不同的框架和每个框架内建模依赖性。然而,大多数这些方法忽视了空间关系(每个框架),没有充分利用时间关系(在不同框架中 ) 。在本文件中,我们提出了一个新的变压器框架,称为 TransVOS, 引入一个视野变异器,以充分利用和模拟时间和空间关系。此外,大多数基于STM的方法使用两个不同的编码器来提取两个重要投入的特征,即参考集(带有预测面具的历史框架)和查询框架,分别增加模型参数和复杂性。要缩小流行的双电解管管道,同时保持有效性,我们设计一个单一的双向特征提取器,以统一的方式将上述两种投入编码。广泛的实验将显示我们的 TransVOS优于DVIS和YouTube-VOS数据集时,将显示我们DVIS和YouTube-VOS数据集解出时的状态方法。