A goal of artificial intelligence is to construct an agent that can solve a wide variety of tasks. Recent progress in text-guided image synthesis has yielded models with an impressive ability to generate complex novel images, exhibiting combinatorial generalization across domains. Motivated by this success, we investigate whether such tools can be used to construct more general-purpose agents. Specifically, we cast the sequential decision making problem as a text-conditioned video generation problem, where, given a text-encoded specification of a desired goal, a planner synthesizes a set of future frames depicting its planned actions in the future, after which control actions are extracted from the generated video. By leveraging text as the underlying goal specification, we are able to naturally and combinatorially generalize to novel goals. The proposed policy-as-video formulation can further represent environments with different state and action spaces in a unified space of images, which, for example, enables learning and generalization across a variety of robot manipulation tasks. Finally, by leveraging pretrained language embeddings and widely available videos from the internet, the approach enables knowledge transfer through predicting highly realistic video plans for real robots.
翻译:人工智能的目标是构建一个能够解决多种任务的工具。最近在文本制成图像合成方面取得的进展产生了一些模型,这些模型具有令人印象深刻的能力来生成复杂的新图像,展示了跨域的组合式概括性。受这一成功激励,我们调查了这些工具是否可以用于构建更多的通用代理物。具体地说,我们将顺序决策问题作为一个以文本为条件的视频生成问题,根据对预期目标的文本编码规格,一个规划者合成了一组未来框架,描述其未来计划的行动,然后从生成的视频中提取控制行动。通过利用文本作为基本目标规范,我们能够自然和组合式地概括到新的目标。拟议的“政策性”设计可以在一个统一的图像空间中进一步代表不同状态和行动空间的环境,例如,它能够使各种机器人操纵任务能够学习和普及。最后,通过利用预先培训的语言嵌入和互联网上广泛提供的视频,通过对真实机器人的高度现实的视频计划进行知识转让。