Spatiotemporal predictive learning aims to generate future frames by learning from historical frames. In this paper, we investigate existing methods and present a general framework of spatiotemporal predictive learning, in which the spatial encoder and decoder capture intra-frame features and the middle temporal module catches inter-frame correlations. While the mainstream methods employ recurrent units to capture long-term temporal dependencies, they suffer from low computational efficiency due to their unparallelizable architectures. To parallelize the temporal module, we propose the Temporal Attention Unit (TAU), which decomposes the temporal attention into intra-frame statical attention and inter-frame dynamical attention. Moreover, while the mean squared error loss focuses on intra-frame errors, we introduce a novel differential divergence regularization to take inter-frame variations into account. Extensive experiments demonstrate that the proposed method enables the derived model to achieve competitive performance on various spatiotemporal prediction benchmarks.
翻译:时空预测性学习的目标是从历史帧中学习,生成未来帧。本文研究了现有方法,并提出了一个通用的时空预测学习框架,其中空间编码器和解码器捕获帧内特征,中间的时间模块捕获帧间相关性。虽然主流方法使用循环单元捕获长期时间依赖关系,但由于其无法并行化的架构,它们面临着低计算效率的问题。为了并行化时间模块,我们提出了时间注意力单元(Temporal Attention Unit,TAU),它将时间注意力分解为帧内静态注意力和帧间动态注意力。此外,虽然均方误差损失关注帧内误差,但我们引入了一种新的差分散度正则化方法,来考虑帧间变化。广泛的实验表明,所提出的方法使得模型在各种时空预测基准测试中实现了具有竞争力的性能。