The objective of this paper is a model that is able to discover, track and segment multiple moving objects in a video. We make four contributions: First, we introduce an object-centric segmentation model with a depth-ordered layer representation. This is implemented using a variant of the transformer architecture that ingests optical flow, where each query vector specifies an object and its layer for the entire video. The model can effectively discover multiple moving objects and handle mutual occlusions; Second, we introduce a scalable pipeline for generating synthetic training data with multiple objects, significantly reducing the requirements for labour-intensive annotations, and supporting Sim2Real generalisation; Third, we show that the model is able to learn object permanence and temporal shape consistency, and is able to predict amodal segmentation masks; Fourth, we evaluate the model on standard video segmentation benchmarks, DAVIS, MoCA, SegTrack, FBMS-59, and achieve state-of-the-art unsupervised segmentation performance, even outperforming several supervised approaches. With test-time adaptation, we observe further performance boosts.
翻译:本文的目标是一个能够发现、跟踪和分解视频中多个移动对象的模型。 我们做出了四项贡献: 首先,我们引入一个以物体为中心的分离模型,具有深度定序层代表。这是使用变压器结构的变异模型来实施,该变压器结构将吸收光流,每个查询矢量为整个视频指定一个对象及其层。 该模型可以有效发现多个移动对象并处理相互隔离; 第二,我们引入一个可扩展管道,用于生成包含多个对象的合成培训数据,大大减少对劳动密集型说明的需求,支持Sim2Real 概括化; 第三,我们显示该模型能够学习对象的持久性和时间形状一致性,并能够预测模式分割面; 第四,我们评估标准视频分割基准、DAVIS、MACA、SegTrack、FBMS-59的模型,并实现最先进的非超超超超超超超超常分解功能性能,甚至超超常几种受监督的方法。我们通过测试时观察到了进一步的性能增强。