The ability to predict future visual observations conditioned on past observations and motor commands can enable embodied agents to plan solutions to a variety of tasks in complex environments. This work shows that we can create good video prediction models by pre-training transformers via masked visual modeling. Our approach, named MaskViT, is based on two simple design decisions. First, for memory and training efficiency, we use two types of window attention: spatial and spatiotemporal. Second, during training, we mask a variable percentage of tokens instead of a fixed mask ratio. For inference, MaskViT generates all tokens via iterative refinement where we incrementally decrease the masking ratio following a mask scheduling function. On several datasets we demonstrate that MaskViT outperforms prior works in video prediction, is parameter efficient, and can generate high-resolution videos (256x256). Further, we demonstrate the benefits of inference speedup (up to 512x) due to iterative decoding by using MaskViT for planning on a real robot. Our work suggests that we can endow embodied agents with powerful predictive models by leveraging the general framework of masked visual modeling with minimal domain knowledge.
翻译:预测以过去观测和运动指令为条件的未来视觉观测的能力,可以使装饰剂能够规划复杂环境中各种任务的解决办法。 这项工作表明, 我们可以通过隐形视觉模型, 通过培训前变压器, 创建良好的视频预测模型。 我们的方法叫做 MaskViT, 是基于两个简单的设计决定。 首先, 为了记忆和培训效率, 我们使用两种类型的窗口关注: 空间和空间时空。 第二, 在培训期间, 我们用一个固定的遮罩比例来掩盖一个可变的标记百分比。 关于推断, MaskViT 通过迭代改进生成所有标记, 在那里我们逐渐减少遮罩列表功能之后的遮罩率。 在几个数据集上, 我们证明MaskViT 超越了先前在视频预测中的工作, 参数效率很高, 并且能够生成高分辨率视频( 256x256) 。 此外, 我们展示了通过使用MaskViT来规划真正的机器人, 迭代解码加速( 高达512x) 的好处。 我们的工作表明, 我们可以用一个最起码的域知识, 来利用遮罩视觉模型总框架, 可以将一个强大的预测模型粉状的代理压模化剂组成代理人。