Animals have evolved highly functional visual systems to understand motion, assisting perception even under complex environments. In this paper, we work towards developing a computer vision system able to segment objects by exploiting motion cues, i.e. motion segmentation. We make the following contributions: First, we introduce a simple variant of the Transformer to segment optical flow frames into primary objects and the background. Second, we train the architecture in a self-supervised manner, i.e. without using any manual annotations. Third, we analyze several critical components of our method and conduct thorough ablation studies to validate their necessity. Fourth, we evaluate the proposed architecture on public benchmarks (DAVIS2016, SegTrackv2, and FBMS59). Despite using only optical flow as input, our approach achieves superior or comparable results to previous state-of-the-art self-supervised methods, while being an order of magnitude faster. We additionally evaluate on a challenging camouflage dataset (MoCA), significantly outperforming the other self-supervised approaches, and comparing favourably to the top supervised approach, highlighting the importance of motion cues, and the potential bias towards visual appearance in existing video segmentation models.
翻译:动物已经发展了高度功能化的视觉系统来理解运动,帮助人们在复杂的环境中也能理解运动。在本文中,我们致力于开发一个计算机视觉系统,能够通过运动提示(即运动分割)来分割物体。我们做出了以下贡献:首先,我们引入了一种简单的变异变体,将光学流动框架分割到主要对象和背景中。第二,我们以自我监督的方式,即不使用任何手动说明来培训建筑。第三,我们分析了我们方法的若干关键组成部分,并进行了彻底的对比研究,以验证其必要性。第四,我们评估了拟议的公共基准结构(DAVIS2016、SegTracrackv2和FBMS59)。尽管我们只使用光学流作为投入,但我们的方法取得了优异或可比的结果,与以前的最先进的自我监督方法相比,而其规模则更快。我们进一步评估了具有挑战性的迷彩数据集(MCA),大大优于其他自我监督的方法,并比较了顶级方法,强调了运动提示的重要性,以及现有视频路段模型中对视觉外观的潜在偏差。