Given a raw video sequence taken from a freely-moving camera, we study the problem of decomposing the observed 3D scene into a static background and a dynamic foreground containing the objects that move in the video sequence. This task is reminiscent of the classic background subtraction problem, but is significantly harder because all parts of the scene, static and dynamic, generate a large apparent motion due to the camera large viewpoint change. In particular, we consider egocentric videos and further separate the dynamic component into objects and the actor that observes and moves them. We achieve this factorization by reconstructing the video via a triple-stream neural rendering network that explains the different motions based on corresponding inductive biases. We demonstrate that our method can successfully separate the different types of motion, outperforming recent neural rendering baselines at this task, and can accurately segment moving objects. We do so by assessing the method empirically on challenging videos from the EPIC-KITCHENS dataset which we augment with appropriate annotations to create a new benchmark for the task of dynamic object segmentation on unconstrained video sequences, for complex 3D environments.
翻译:根据从自由移动的相机中拍摄的原始视频序列,我们研究将已观测到的 3D 场景分解成静态背景和一个动态前景,包含在视频序列中移动的物体。这个任务与经典背景减色问题有共鸣,但难度大得多,因为场景的各个部分,即静态和动态,都会产生巨大的明显动作,因为相机的视野变化很大。特别是,我们考虑以自我为中心的视频,进一步将动态组件分解为对象和观察和移动它们的行为者。我们通过一个三流神经显示网络来重建视频,解释基于相应感应偏差的不同动作。我们证明,我们的方法可以成功地区分不同类型的运动,比最近神经显示的基线要强,并且能够准确地移动物体。我们这样做的方法是,用适当的说明来评估由EPIC-KITCHENS数据集中具有挑战性的视频的实验方法,我们用适当的说明来补充这些视频,以便为复杂 3D 环境的不受控制的视频序列的动态物体分割任务创造新的基准。