Video prediction is a complex time-series forecasting task with great potential in many use cases. However, conventional methods overemphasize accuracy while ignoring the slow prediction speed caused by complicated model structures that learn too much redundant information with excessive GPU memory consumption. Furthermore, conventional methods mostly predict frames sequentially (frame-by-frame) and thus are hard to accelerate. Consequently, valuable use cases such as real-time danger prediction and warning cannot achieve fast enough inference speed to be applicable in reality. Therefore, we propose a transformer-based keypoint prediction neural network (TKN), an unsupervised learning method that boost the prediction process via constrained information extraction and parallel prediction scheme. TKN is the first real-time video prediction solution to our best knowledge, while significantly reducing computation costs and maintaining other performance. Extensive experiments on KTH and Human3.6 datasets demonstrate that TKN predicts 11 times faster than existing methods while reducing memory consumption by 17.4% and achieving state-of-the-art prediction performance on average.
翻译:视频预测是一项复杂的时间序列预测任务,在许多应用中具有巨大的潜力。然而,传统方法过于强调精度,忽略了由过于复杂的模型结构所导致的缓慢预测速度和过多冗余信息的GPU内存消耗。此外,传统方法大多按顺序进行帧预测,因此难以加速。因此,我们提出了一种基于Transformer的关键点预测神经网络(TKN),这是一种无监督学习方法,通过受约束的信息提取和并行预测方案来改进预测过程。在我们所知道的范围内,TKN是第一个真正实时的视频预测解决方案,同时显着降低计算成本并保持其他性能。在KTH和Human3.6数据集上的广泛实验表明,相比现有方法,TKN预测速度提高了11倍,内存消耗降低了17.4%,并在平均水平上实现了最先进的预测性能。