Amodal perception requires inferring the full shape of an object that is partially occluded. This task is particularly challenging on two levels: (1) it requires more information than what is contained in the instant retina or imaging sensor, (2) it is difficult to obtain enough well-annotated amodal labels for supervision. To this end, this paper develops a new framework of Self-supervised amodal Video object segmentation (SaVos). Our method efficiently leverages the visual information of video temporal sequences to infer the amodal mask of objects. The key intuition is that the occluded part of an object can be explained away if that part is visible in other frames, possibly deformed as long as the deformation can be reasonably learned. Accordingly, we derive a novel self-supervised learning paradigm that efficiently utilizes the visible object parts as the supervision to guide the training on videos. In addition to learning type prior to complete masks for known types, SaVos also learns the spatiotemporal prior, which is also useful for the amodal task and could generalize to unseen types. The proposed framework achieves the state-of-the-art performance on the synthetic amodal segmentation benchmark FISHBOWL and the real world benchmark KINS-Video-Car. Further, it lends itself well to being transferred to novel distributions using test-time adaptation, outperforming existing models even after the transfer to a new distribution.
翻译:现代感知要求推断一个部分隐蔽的物体的完整形状。 这项任务在两个层面特别具有挑战性:(1) 它需要比瞬时视网膜或成像传感器中包含的内容更多的信息,(2) 很难获得足够的附有良好说明的调制标签来进行监督。 为此,本文件开发了一个新的自我监督的现代视频对象分割框架(saVos)。 我们的方法有效地利用视频时间序列的视觉信息来推断物体的现代遮罩。 关键直觉是, 如果一个物体的隐蔽部分在其它框架中可见, 其隐蔽的部分可以被解释出来, 只要能够合理地了解变形, 就可能变形。 因此, 我们产生了一个新的自我监督的学习模式, 有效地利用可见的物体部分来指导视频培训。 除了在为已知类型提供完整的遮罩之前学习型外, SaVos还学习了之前的“ 闪烁式”, 这对现代任务也是有用的, 并且可以将该部分推广到看不见的种类。 拟议的框架在合成ISISFM 的测试阶段中实现了业绩, 正在将现有的测试阶段转换成新的世界范围。