Computer vision is increasingly effective at segmenting objects in images and videos; however, scene effects related to the objects -- shadows, reflections, generated smoke, etc -- are typically overlooked. Identifying such scene effects and associating them with the objects producing them is important for improving our fundamental understanding of visual scenes, and can also assist a variety of applications such as removing, duplicating, or enhancing objects in video. In this work, we take a step towards solving this novel problem of automatically associating objects with their effects in video. Given an ordinary video and a rough segmentation mask over time of one or more subjects of interest, we estimate an omnimatte for each subject -- an alpha matte and color image that includes the subject along with all its related time-varying scene elements. Our model is trained only on the input video in a self-supervised manner, without any manual labels, and is generic -- it produces omnimattes automatically for arbitrary objects and a variety of effects. We show results on real-world videos containing interactions between different types of subjects (cars, animals, people) and complex effects, ranging from semi-transparent elements such as smoke and reflections, to fully opaque effects such as objects attached to the subject.
翻译:计算机的视觉在图像和视频中对物体进行分割方面越来越有效;然而,与物体 -- -- 阴影、反射、生成的烟雾等 -- -- 有关的场景效果通常被忽视。辨别这些场景效果并将它们与产生这些效果的物体联系起来,对于提高我们对视觉场景的基本了解非常重要,并且还有助于各种应用,例如删除、复制或增强视频中的物体。在这项工作中,我们迈出了一步,以解决在视频中自动将物体与其效果联系起来这一新颖问题。鉴于一个普通的视频和粗略的分割面罩,我们估计每个主题都有一个无孔片 -- -- 包括该主题及其所有相关的时间变化场景要素的阿尔法配方和彩色图像。我们的模型仅以自我监督的方式对输入视频进行培训,没有手动的标签,而且很普通 -- -- 它为任意物体和各种效果自动制作全纳马特片。我们展示了真实世界视频的结果,包含不同主题(汽车、动物、人)和复杂效果的相互作用,从半透明物体的半透明、如烟雾和反射效果到完全的物体。