Spatiotemporal predictive learning is to generate future frames given a sequence of historical frames. Conventional algorithms are mostly based on recurrent neural networks (RNNs). However, RNN suffers from heavy computational burden such as time and long back-propagation process due to the seriality of recurrent structure. Recently, Transformer-based methods have also been investigated in the form of encoder-decoder or plain encoder, but the encoder-decoder form requires too deep networks and the plain encoder is lack of short-term dependencies. To tackle these problems, we propose an algorithm named 3D-temporal convolutional transformer (TCTN), where a transformer-based encoder with temporal convolutional layers is employed to capture short-term and long-term dependencies. Our proposed algorithm can be easy to implement and trained much faster compared with RNN-based methods thanks to the parallel mechanism of Transformer. To validate our algorithm, we conduct experiments on the MovingMNIST and KTH dataset, and show that TCTN outperforms state-of-the-art (SOTA) methods in both performance and training speed.
翻译:由于历史框架的顺序,常规算法主要基于经常性神经网络(RNN),然而,由于经常结构的序列性,RNN承受着时间和长时间后反向调整过程等沉重的计算负担。最近,以变换器为基础的方法也以编码解码器或普通编码器的形式进行了调查,但编码解码器的形式需要过深的网络,而普通编码编码器缺乏短期依赖性。为了解决这些问题,我们提议了一个名为3D时相变变变法(TCTN)的算法,在这个算法中,使用一个基于变异器的变异器与时变异层的变异编码器来捕捉短期和长期依赖性。由于变异器的平行机制,我们提议的算法可以更容易地执行和培训速度比以变异器为基础的方法要快得多。为了验证我们的算法,我们进行了移动MNIST和KTH数据集的实验,并显示在速度和性能方法中,TCTN超越了状态。