We introduce a method to segment the visual field into independently moving regions, trained with no ground truth or supervision. It consists of an adversarial conditional encoder-decoder architecture based on Slot Attention, modified to use the image as context to decode optical flow without attempting to reconstruct the image itself. In the resulting multi-modal representation, one modality (flow) feeds the encoder to produce separate latent codes (slots), whereas the other modality (image) conditions the decoder to generate the first (flow) from the slots. This design frees the representation from having to encode complex nuisance variability in the image due to, for instance, illumination and reflectance properties of the scene. Since customary autoencoding based on minimizing the reconstruction error does not preclude the entire flow from being encoded into a single slot, we modify the loss to an adversarial criterion based on Contextual Information Separation. The resulting min-max optimization fosters the separation of objects and their assignment to different attention slots, leading to Divided Attention, or DivA. DivA outperforms recent unsupervised multi-object motion segmentation methods while tripling run-time speed up to 104FPS and reducing the performance gap from supervised methods to 12% or less. DivA can handle different numbers of objects and different image sizes at training and test time, is invariant to permutation of object labels, and does not require explicit regularization.
翻译:我们引入了一种方法来将视觉场分段为独立移动区域,无需基于地面实况或监督来进行训练。它由基于Slot Attention的对抗条件编码器-解码器体系结构组成,通过使用图像作为上下文进行解码而无需尝试重建图像本身从而生成光流。 在生成的多模式表示中,一种模式(流)将编码器馈送以产生独立的潜在编码(插槽),而另一种模式(图像)则调节解码器以使其从插槽生成第一个(流)。这种设计使表示从不必对场景的照明和反射特性等复杂的干扰变异进行编码。由于基于最小化重建误差的传统自动编码不能排除将整个光流编码到一个插槽中,因此我们将损失修改为基于上下文信息分离的对抗性标准。 由此带来的min-max优化促进了物体的分离和分配到不同的Attention插槽,从而导致分割关注,或者DivA。DivA在三倍的运行时间加速到104FPS的同时优于最近的无监督多对象运动分割方法,并将性能差距从受监督的方法降低到12%或更低。DivA可以处理不同数量的对象和不同的图像大小,在训练和测试时,具有排列物体标签的不变性,而且不需要显式正则化。