Autonomous systems not only need to understand their current environment, but should also be able to predict future actions conditioned on past states, for instance based on captured camera frames. However, existing models mainly focus on forecasting future video frames for short time-horizons, hence being of limited use for long-term action planning. We propose Multi-Scale Hierarchical Prediction (MSPred), a novel video prediction model able to simultaneously forecast future possible outcomes of different levels of granularity at different spatio-temporal scales. By combining spatial and temporal downsampling, MSPred efficiently predicts abstract representations such as human poses or locations over long time horizons, while still maintaining a competitive performance for video frame prediction. In our experiments, we demonstrate that MSPred accurately predicts future video frames as well as high-level representations (e.g. keypoints or semantics) on bin-picking and action recognition datasets, while consistently outperforming popular approaches for future frame prediction. Furthermore, we ablate different modules and design choices in MSPred, experimentally validating that combining features of different spatial and temporal granularity leads to a superior performance. Code and models to reproduce our experiments can be found in https://github.com/AIS-Bonn/MSPred.
翻译:自主系统不仅需要了解其当前环境,而且还应当能够预测以过去状态为条件的未来行动,例如以摄像框架为基础。然而,现有模型主要侧重于预测未来短时间光谱的视频框架,因此可用于长期行动规划的用途有限。我们提议多层次分层预测(MSPred),这是一个新的视频预测模型,能够同时预测不同空间-时空尺度不同颗粒水平未来可能的结果。通过将空间和时间下游抽样结合起来,MSP高效地预测了诸如人造物或长期地平线位置等抽象的表述,同时仍然保持视频框架预测的竞争性性能。我们在实验中表明,MSPred准确预测了未来视频框架和高级演示(例如关键点或语义),同时可以同时预测不同水平颗粒和动作识别数据集的未来结果。此外,我们将不同空间和时空颗粒度特征组合在一起,在MSP中设计不同的模块和设计选择,实验性地验证了不同空间和时空颗粒度的特征。我们发现,MSP-AIS可以复制一个高级的模型。