While stochastic video prediction models enable future prediction under uncertainty, they mostly fail to model the complex dynamics of real-world scenes. For example, they cannot provide reliable predictions for scenes with a moving camera and independently moving foreground objects in driving scenarios. The existing methods fail to fully capture the dynamics of the structured world by only focusing on changes in pixels. In this paper, we assume that there is an underlying process creating observations in a video and propose to factorize it into static and dynamic components. We model the static part based on the scene structure and the ego-motion of the vehicle, and the dynamic part based on the remaining motion of the dynamic objects. By learning separate distributions of changes in foreground and background, we can decompose the scene into static and dynamic parts and separately model the change in each. Our experiments demonstrate that disentangling structure and motion helps stochastic video prediction, leading to better future predictions in complex driving scenarios on two real-world driving datasets, KITTI and Cityscapes.
翻译:虽然随机视频预测模型能够在未来不确定的情况下进行预测,但它们大多无法模拟真实世界场景的复杂动态。 例如,它们无法以移动相机和在驾驶场景中独立移动的表面物体为场景提供可靠的预测。 现有方法仅仅侧重于像素的变化,无法充分捕捉结构世界的动态。 在本文中, 我们假设有一个内在过程, 在视频中生成观测结果, 并提议将它转化为静态和动态组成部分。 我们根据现场结构和飞行器的自我移动, 以及动态物体的剩余运动来模拟静态部分, 以及动态部分的动态部分。 通过在前方和背景中分别了解变化的分布, 我们可以将场景分解成静态和动态部分, 并分别对每个变化进行模型。 我们的实验表明, 脱钩结构和运动有助于随机的视频预测, 从而在两种真实世界驱动数据集、 KITTI 和 Cityscorps 的复杂驱动情景中实现更好的未来预测。