Spatiotemporal predictive learning aims to generate future frames by learning from historical frames. In this paper, we investigate existing methods and present a general framework of spatiotemporal predictive learning, in which the spatial encoder and decoder capture intra-frame features and the middle temporal module catches inter-frame correlations. While the mainstream methods employ recurrent units to capture long-term temporal dependencies, they suffer from low computational efficiency due to their unparallelizable architectures. To parallelize the temporal module, we propose the Temporal Attention Unit (TAU), which decomposes the temporal attention into intra-frame statical attention and inter-frame dynamical attention. Moreover, while the mean squared error loss focuses on intra-frame errors, we introduce a novel differential divergence regularization to take inter-frame variations into account. Extensive experiments demonstrate that the proposed method enables the derived model to achieve competitive performance on various spatiotemporal prediction benchmarks.
翻译:时空预测学习旨在通过学习历史帧来生成未来帧。在本文中,我们调查了现有的方法并提出了一个通用的时空预测学习框架,其中空间编码器和解码器捕捉帧内特征,中间的时间模块捕捉帧间相关性。虽然主流方法采用循环单元来捕捉长期时间依赖性,但由于其不可并行化的架构,它们在计算效率上面临着较低的效率。为了并行化时间模块,我们提出了时间注意力单元(TAU),将时间注意力分解为帧内静态注意力和帧间动态注意力。此外,虽然均方误差损失专注于帧内误差,我们引入了一种新颖的差分发散正则化方法,考虑帧间变化。广泛的实验表明,所提出的方法使得推导出的模型在各种时空预测基准上达到了有竞争力的性能。