STIP: 高分辨率视频预报空间时空信息保存和感知增强模型 (STIP: A SpatioTemporal Information-Preserving and Perception-Augmented Model for High-Resolution Video Prediction)

Although significant achievements have been achieved by recurrent neural network (RNN) based video prediction methods, their performance in datasets with high resolutions is still far from satisfactory because of the information loss problem and the perception-insensitive mean square error (MSE) based loss functions. In this paper, we propose a Spatiotemporal Information-Preserving and Perception-Augmented Model (STIP) to solve the above two problems. To solve the information loss problem, the proposed model aims to preserve the spatiotemporal information for videos during the feature extraction and the state transitions, respectively. Firstly, a Multi-Grained Spatiotemporal Auto-Encoder (MGST-AE) is designed based on the X-Net structure. The proposed MGST-AE can help the decoders recall multi-grained information from the encoders in both the temporal and spatial domains. In this way, more spatiotemporal information can be preserved during the feature extraction for high-resolution videos. Secondly, a Spatiotemporal Gated Recurrent Unit (STGRU) is designed based on the standard Gated Recurrent Unit (GRU) structure, which can efficiently preserve spatiotemporal information during the state transitions. The proposed STGRU can achieve more satisfactory performance with a much lower computation load compared with the popular Long Short-Term (LSTM) based predictive memories. Furthermore, to improve the traditional MSE loss functions, a Learned Perceptual Loss (LP-loss) is further designed based on the Generative Adversarial Networks (GANs), which can help obtain a satisfactory trade-off between the objective quality and the perceptual quality. Experimental results show that the proposed STIP can predict videos with more satisfactory visual quality compared with a variety of state-of-the-art methods. Source code has been available at \url{https://github.com/ZhengChang467/STIPHR}.

翻译：尽管通过基于经常性神经网络的视频预测方法取得了显著成就,但基于反复的神经网络(RNN)的视频预测方法,他们在具有高分辨率的数据集中的性能仍然远远不能令人满意,因为信息丢失问题和基于感知不敏感的平均平方错误(MSE)的亏损功能。在本文中,我们建议采用“Spatotote-时间信息保存和感知增强模型(STIP)”来解决上述两个问题。为了解决信息丢失问题,拟议的模型旨在分别保存特征提取和州转型期间用于视频的频谱信息。首先,根据X-Net结构设计了一个多级SDVST-Ecard(MGST-AE)的多级SPatotomote-Sad-Encoder(MGST-AE)多级数据存储。拟议的MST-AE可以帮助拆分解器在时间和空间域域中都忆及多级的多级信息。在为高分辨率视频提取时,可以保存更多的波度信息(STGRU) 和更低级的平级的平级的平流数据结构,可以在高级的平级的平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平级平。