The predictive learning of spatiotemporal sequences aims to generate future images by learning from the historical context, where the visual dynamics are believed to have modular structures that can be learned with compositional subsystems. This paper models these structures by presenting PredRNN, a new recurrent network, in which a pair of memory cells are explicitly decoupled, operate in nearly independent transition manners, and finally form unified representations of the complex environment. Concretely, besides the original memory cell of LSTM, this network is featured by a zigzag memory flow that propagates in both bottom-up and top-down directions across all layers, enabling the learned visual dynamics at different levels of RNNs to communicate. It also leverages a memory decoupling loss to keep the memory cells from learning redundant features. We further propose a new curriculum learning strategy to force PredRNN to learn long-term dynamics from context frames, which can be generalized to most sequence-to-sequence models. We provide detailed ablation studies to verify the effectiveness of each component. Our approach is shown to obtain highly competitive results on five datasets for both action-free and action-conditioned predictive learning scenarios.
翻译:预测时空序列的学习旨在通过从历史背景中学习来生成未来图像,据认为视觉动态具有模块结构,可以与组成子系统一起学习。本文通过展示PredRNN这个新的经常性网络来模拟这些结构。PredRNN是一个新的经常性网络,其中一对记忆细胞被明确分离,以几乎独立的过渡方式运作,最后形成对复杂环境的统一表述。具体地说,除了LSTM的原始记忆细胞外,这个网络还以Zigzag记忆流为特征,该记忆流在每层的自下和自上向下传播,使各个层次的RNNNS的学习视觉动态能够进行交流。它还利用记忆解析损失手段使记忆细胞避免学习冗余特性。我们进一步提出一个新的课程学习战略,迫使PredRNNN从背景中学习长期的动态,这些动态可以普遍到大多数序列至序列模型。我们提供了详细的连接研究,以核实每个组成部分的有效性。我们的方法显示在五个数据集上获得高度竞争性的结果,以便进行不采取行动的预测。