MIMO是所有你需要的:一个强有力的视频预测多出多功能基线 (MIMO Is All You Need : A Strong Multi-In-Multi-Out Baseline for Video Prediction)

The mainstream of the existing approaches for video prediction builds up their models based on a Single-In-Single-Out (SISO) architecture, which takes the current frame as input to predict the next frame in a recursive manner. This way often leads to severe performance degradation when they try to extrapolate a longer period of future, thus limiting the practical use of the prediction model. Alternatively, a Multi-In-Multi-Out (MIMO) architecture that outputs all the future frames at one shot naturally breaks the recursive manner and therefore prevents error accumulation. However, only a few MIMO models for video prediction are proposed and they only achieve inferior performance due to the date. The real strength of the MIMO model in this area is not well noticed and is largely under-explored. Motivated by that, we conduct a comprehensive investigation in this paper to thoroughly exploit how far a simple MIMO architecture can go. Surprisingly, our empirical studies reveal that a simple MIMO model can outperform the state-of-the-art work with a large margin much more than expected, especially in dealing with longterm error accumulation. After exploring a number of ways and designs, we propose a new MIMO architecture based on extending the pure Transformer with local spatio-temporal blocks and a new multi-output decoder, namely MIMO-VP, to establish a new standard in video prediction. We evaluate our model in four highly competitive benchmarks (Moving MNIST, Human3.6M, Weather, KITTI). Extensive experiments show that our model wins 1st place on all the benchmarks with remarkable performance gains and surpasses the best SISO model in all aspects including efficiency, quantity, and quality. We believe our model can serve as a new baseline to facilitate the future research of video prediction tasks. The code will be released.

翻译：视频预测现有方法的主流在单一单一输出(SISO)架构的基础上构建了模型,将当前框架作为投入,以预测下一个框架。这往往导致绩效严重退化,如果它们试图推断未来更长的时期,从而限制对预测模型的实际使用。或者,多输出(MIMO)架构,将所有未来框架以一个镜头自然地打破循环方式,从而防止错误累积。然而,只提出几个用于视频预测的MIMO模型,它们只能随着日期而取得低劣的性能。MIMO模型在这一领域的真正实力没有很好地受到注意,而且基本上没有得到充分的探索。因此,我们在本文件中进行全面调查,彻底探索一个简单的IMO架构能够走多远。令人惊讶的是,我们的经验研究表明,一个简单的MIMO模型可以自然地破解循环,从而防止错误的累积。一个比预期的更多新的时间点,特别是在处理长期的IMO错误累积时。在探索一个模型之后,我们用一个全新的模型来展示我们未来的模型, 包括一个高级的模型, 将展示一个高级的模型, 展示一个新的模型, 数字的模型, 显示一个新的模型, 数字的SOMIB的模型, 和数字的模型将显示一个新的结构中, 数字的模型, 显示一个数字的模型将显示一个数字的模型, 数字的模型将显示一个数字的模型, 数字的模型将显示一个数字的模型将显示一个新的数字的模型, 和数字的模型, 数字的模型将显示一个数字的模型将显示一个数字的模型将显示一个数字的模型, 数字的模型, 显示一个数字的模型将显示一个数字的模型, 和数字的模型将显示一个数字的模型, 和数字的模型将显示一个在一个在一个在数字的模型, 。