It has been a long-standing dream to design artificial agents that explore their environment efficiently via intrinsic motivation, similar to how children perform curious free play. Despite recent advances in intrinsically motivated reinforcement learning (RL), sample-efficient exploration in object manipulation scenarios remains a significant challenge as most of the relevant information lies in the sparse agent-object and object-object interactions. In this paper, we propose to use structured world models to incorporate relational inductive biases in the control loop to achieve sample-efficient and interaction-rich exploration in compositional multi-object environments. By planning for future novelty inside structured world models, our method generates free-play behavior that starts to interact with objects early on and develops more complex behavior over time. Instead of using models only to compute intrinsic rewards, as commonly done, our method showcases that the self-reinforcing cycle between good models and good exploration also opens up another avenue: zero-shot generalization to downstream tasks via model-based planning. After the entirely intrinsic task-agnostic exploration phase, our method solves challenging downstream tasks such as stacking, flipping, pick & place, and throwing that generalizes to unseen numbers and arrangements of objects without any additional training.
翻译:设计人造物剂,通过内在动机有效探索环境是一个长期的梦想,类似于儿童如何进行好奇的自由游戏。尽管在本质上有动机的强化学习(RL)方面最近有所进步,但是对物体操纵情景的抽样有效探索仍是一个重大挑战,因为大多数相关信息都存在于稀疏的物剂-对象和对象-对象相互作用中。在本文中,我们提议使用结构化的世界模型,在控制循环中引入关联感性诱导偏见,以在组合式多对象环境中实现抽样高效和互动丰富的探索。通过规划未来结构化世界模型中的新颖性,我们的方法产生自由玩耍行为,开始在早期与对象互动,并随着时间的推移形成更复杂的行为。我们的方法没有使用模型来仅仅计算内在的奖赏,而是像通常所做的那样,我们的方法表明,良好模型和良好探索之间的自我强化循环也开辟了另一个途径:通过模型规划将零光化的概括化到下游任务。在完全固有的任务-不可知性探索阶段之后,我们的方法解决了挑战下游任务的任务,例如堆叠、翻转、摘取和将更多的物体推向任何看不见的数字和训练安排。