Despite significant efforts, cutting-edge video segmentation methods still remain sensitive to occlusion and rapid movement, due to their reliance on the appearance of objects in the form of object embeddings, which are vulnerable to these disturbances. A common solution is to use optical flow to provide motion information, but essentially it only considers pixel-level motion, which still relies on appearance similarity and hence is often inaccurate under occlusion and fast movement. In this work, we study the instance-level motion and present InstMove, which stands for Instance Motion for Object-centric Video Segmentation. In comparison to pixel-wise motion, InstMove mainly relies on instance-level motion information that is free from image feature embeddings, and features physical interpretations, making it more accurate and robust toward occlusion and fast-moving objects. To better fit in with the video segmentation tasks, InstMove uses instance masks to model the physical presence of an object and learns the dynamic model through a memory network to predict its position and shape in the next frame. With only a few lines of code, InstMove can be integrated into current SOTA methods for three different video segmentation tasks and boost their performance. Specifically, we improve the previous arts by 1.5 AP on OVIS dataset, which features heavy occlusions, and 4.9 AP on YouTubeVIS-Long dataset, which mainly contains fast-moving objects. These results suggest that instance-level motion is robust and accurate, and hence serving as a powerful solution in complex scenarios for object-centric video segmentation.
翻译:尽管做出了重大努力,但尖端视频分解方法仍然对封闭和快速移动十分敏感,因为它们依赖于以物体嵌入为形式的物体外观,很容易发生这些扰动。一个共同的解决方案是使用光学流提供运动信息,但基本上只考虑像素级运动,它仍然依赖外观相似性,因此在隔离和快速移动下往往不准确。在这项工作中,我们研究实例级运动和当前的Instmove,它代表着以物体为中心的视频分解过程。与平流运动相比,Instmove主要依赖于不使用图像嵌入的物体外观,并具有物理解释特征,使其更准确和稳健。为了更好地适应视频分解任务,Instove使用掩码来模拟物体的物理存在,并通过记忆网络学习动态模型来预测物体在下一个框架中的位置和形状。与平流运动主要依靠无几行的代码,Instmoveal动作信息级信息信息,而不受图像级的图像级信息级信息级,并具有物理分级,因此,我们可以将先前的静态和图像分级数据分解,从而将快速推进前方段的图像分解。</s>