Video prediction is a challenging task with wide application prospects in meteorology and robot systems. Existing works fail to trade off short-term and long-term prediction performances and extract robust latent dynamics laws in video frames. We propose a two-branch seq-to-seq deep model to disentangle the Taylor feature and the residual feature in video frames by a novel recurrent prediction module (TaylorCell) and residual module. TaylorCell can expand the video frames' high-dimensional features into the finite Taylor series to describe the latent laws. In TaylorCell, we propose the Taylor prediction unit (TPU) and the memory correction unit (MCU). TPU employs the first input frame's derivative information to predict the future frames, avoiding error accumulation. MCU distills all past frames' information to correct the predicted Taylor feature from TPU. Correspondingly, the residual module extracts the residual feature complementary to the Taylor feature. On three generalist datasets (Moving MNIST, TaxiBJ, Human 3.6), our model outperforms or reaches state-of-the-art models, and ablation experiments demonstrate the effectiveness of our model in long-term prediction.
翻译:现有工程未能在视频框中交换短期和长期预测性能,并在视频框中提取强有力的潜在动态法则。我们提议采用一个两分制后至等值的深层模型,通过一个新的经常性预测模块(TaylorCell)和剩余模块来分解泰勒的特征和视频框中的残余特征。TaylorCell可以将视频框的高维特征扩展为有限的泰勒系列,以描述潜在法律。在TaylorCell中,我们提议泰勒预测单位(TPU)和记忆校正单位(MCU)使用第一个输入框架的衍生信息来预测未来框架,避免错误累积。MCU利用所有过去框架的信息来从TPU中校正预测的泰勒特征。相应地,残余模块提取了与泰勒特征相补充的残余特征。关于三个一般数据集(移动MNIST、出租BJ、Human3.6),我们模型的外形出或到达状态模型的模型,以及模拟实验显示我们模型的长期预测的有效性。