A video prediction model that generalizes to diverse scenes would enable intelligent agents such as robots to perform a variety of tasks via planning with the model. However, while existing video prediction models have produced promising results on small datasets, they suffer from severe underfitting when trained on large and diverse datasets. To address this underfitting challenge, we first observe that the ability to train larger video prediction models is often bottlenecked by the memory constraints of GPUs or TPUs. In parallel, deep hierarchical latent variable models can produce higher quality predictions by capturing the multi-level stochasticity of future observations, but end-to-end optimization of such models is notably difficult. Our key insight is that greedy and modular optimization of hierarchical autoencoders can simultaneously address both the memory constraints and the optimization challenges of large-scale video prediction. We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns high-fidelity video predictions by greedily training each level of a hierarchical autoencoder. In comparison to state-of-the-art models, GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules.
翻译:将视频预测模型推广到不同的场景,将使机器人等智能剂通过与模型进行规划来完成各种任务。然而,虽然现有的视频预测模型在小型数据集上产生了有希望的结果,但是,在对大型和多样化数据集进行训练时,这些模型严重不足。为了应对这一不足的挑战,我们首先观察到,培训大型视频预测模型的能力往往由于GPU或TPU的记忆限制而受到瓶颈。同时,深层次潜潜伏变量模型可以通过获取未来观测的多层次随机性来产生质量更高的预测,但这类模型的最终到终端优化尤其困难。我们的主要见解是,等级自动计算机的贪婪和模块优化可以同时解决记忆限制和大规模视频预测的优化挑战。我们引入了Greedy 高度结构自动自动显示器(GHVAE),这种方法通过贪婪式培训每个等级的等级自动解析器进行高密度视频预测,可以产生更高质量的预测,但与州级40模型相比,此类模型的最终到终端优化则特别困难。我们的主要见解是,等级的模块的贪婪和模块可以同时解决大型视频的记忆限制和优化,35VAE的成绩。