Computer vision is increasingly effective at segmenting objects in images and videos; however, scene effects related to the objects---shadows, reflections, generated smoke, etc---are typically overlooked. Identifying such scene effects and associating them with the objects producing them is important for improving our fundamental understanding of visual scenes, and can also assist a variety of applications such as removing, duplicating, or enhancing objects in video. In this work, we take a step towards solving this novel problem of automatically associating objects with their effects in video. Given an ordinary video and a rough segmentation mask over time of one or more subjects of interest, we estimate an omnimatte for each subject---an alpha matte and color image that includes the subject along with all its related time-varying scene elements. Our model is trained only on the input video in a self-supervised manner, without any manual labels, and is generic---it produces omnimattes automatically for arbitrary objects and a variety of effects. We show results on real-world videos containing interactions between different types of subjects (cars, animals, people) and complex effects, ranging from semi-transparent elements such as smoke and reflections, to fully opaque effects such as objects attached to the subject.
翻译:计算机的视觉在图像和视频中对物体进行分解方面越来越有效;然而,通常忽略了与物体 -- -- 阴影、反射、生成的烟雾等 -- -- 有关的场景效果。辨别这些场景效果并将它们与产生这些效果的物体联系起来,对于提高我们对视觉场景的基本了解非常重要,而且能够帮助各种应用,例如删除、复制或增强视频中的物体。在这项工作中,我们迈出了一步,以解决自动将物体与其在视频中的效果联系起来这一新颖问题。鉴于一个普通的视频和一段粗略的分解面罩,我们估计每个主题 -- -- 灰色和彩色图象的全景效果,包括主题及其所有相关的时间变化场景要素。我们的模型仅以自我监督的方式对输入视频进行培训,而没有任何手动标签,并且是通用式的,为任意物体和各种效果自动结合而生成的全纳马特片。我们展示了真实世界视频的结果,其中含有不同类型主题(汽车、动物、人)和复杂效果之间的相互作用,从半透明面的反射成像到完全的不透明反射成像。