Video prediction, forecasting the future frames from a sequence of input frames, is a challenging task since the view changes are influenced by various factors, such as the global context surrounding the scene and local motion dynamics. In this paper, we propose a new framework to integrate these complementary attributes to predict complex pixel dynamics through deep networks. We present global context propagation networks that iteratively aggregate the non-local neighboring representations to preserve the contextual information over the past frames. To capture the local motion pattern of objects, we also devise local filter memory networks that generate adaptive filter kernels by storing the prototypical motion of moving objects in the memory. The proposed framework, utilizing the outputs from both networks, can address blurry predictions and color distortion. We conduct experiments on Caltech pedestrian and UCF101 datasets, and demonstrate state-of-the-art results. Especially for multi-step prediction, we obtain an outstanding performance in quantitative and qualitative evaluation.
翻译:视频预测,从输入框架序列中预测未来框架,是一项艰巨的任务,因为观察变化受到各种因素的影响,例如周围环绕的全球环境和当地运动动态。在本文件中,我们提出一个新的框架,以整合这些互补属性,通过深层网络预测复杂的像素动态。我们介绍全球背景传播网络,以迭接方式将非本地相邻代表处汇总起来,保存过去框架的背景信息。为了捕捉物体的本地运动模式,我们还设计了本地过滤存储网,通过存储记忆中移动物体的原型运动产生适应性过滤内核。拟议框架利用两个网络的产出,可以解决模糊的预测和色彩扭曲问题。我们在卡尔特行人和UCF101数据集上进行了实验,并展示了最新的结果。特别是多步预测,我们在定量和定性评估中取得了杰出的业绩。