Video prediction aims to predict future frames by modeling the complex spatiotemporal dynamics in videos. However, most of the existing methods only model the temporal information and the spatial information for videos in an independent manner but haven't fully explored the correlations between both terms. In this paper, we propose a SpatioTemporal-Aware Unit (STAU) for video prediction and beyond by exploring the significant spatiotemporal correlations in videos. On the one hand, the motion-aware attention weights are learned from the spatial states to help aggregate the temporal states in the temporal domain. On the other hand, the appearance-aware attention weights are learned from the temporal states to help aggregate the spatial states in the spatial domain. In this way, the temporal information and the spatial information can be greatly aware of each other in both domains, during which, the spatiotemporal receptive field can also be greatly broadened for more reliable spatiotemporal modeling. Experiments are not only conducted on traditional video prediction tasks but also other tasks beyond video prediction, including the early action recognition and object detection tasks. Experimental results show that our STAU can outperform other methods on all tasks in terms of performance and computation efficiency.
翻译:视频预测的目的是通过模拟视频中复杂的时空动态来预测未来框架。 然而,大多数现有方法仅以独立的方式模拟视频的时间信息和空间信息,但并未充分探讨这两个术语之间的相互关系。 在本文中,我们提议为视频预测及其他方面设立一个Spatio时空软件股(STAU),以探索视频中重大的时空相关性。一方面,从空间国家学习运动觉察的注意权重,以帮助汇总时间域的时间状态。另一方面,从时间状态中学习表面觉察的注意权重,以帮助汇总空间域空间状态的空间状态。这样,时间信息和空间信息就可以在两个领域彼此之间都非常了解,在此期间,对更可靠的时空模型来说,可大大扩展空间洞察容场。实验不仅针对传统的视频预测任务,而且除了视频预测之外还有其他任务,包括早期行动识别和天体探测任务。实验结果显示,我们的时空信息和空间信息可以超越其他任务的效率。