Building open-ended agents that can autonomously discover a diversity of behaviours is one of the long-standing goals of artificial intelligence. This challenge can be studied in the framework of autotelic RL agents, i.e. agents that learn by selecting and pursuing their own goals, self-organizing a learning curriculum. Recent work identified language as a key dimension of autotelic learning, in particular because it enables abstract goal sampling and guidance from social peers for hindsight relabelling. Within this perspective, we study the following open scientific questions: What is the impact of hindsight feedback from a social peer (e.g. selective vs. exhaustive)? How can the agent learn from very rare language goal examples in its experience replay? How can multiple forms of exploration be combined, and take advantage of easier goals as stepping stones to reach harder ones? To address these questions, we use ScienceWorld, a textual environment with rich abstract and combinatorial physics. We show the importance of selectivity from the social peer's feedback; that experience replay needs to over-sample examples of rare goals; and that following self-generated goal sequences where the agent's competence is intermediate leads to significant improvements in final performance.
翻译:建立能够自主发现多种行为的开放代理物,这是人工智能的长期目标之一。这个挑战可以在自动RL代理物的框架内研究,即通过选择和追求自身目标学习的代理物,自我组织学习课程。最近的工作将语言确定为自动学习的一个关键方面,特别是因为它能够让社会同侪抽象的目标抽样和提供指导,用于后视再贴标签。在这个角度上,我们研究以下开放的科学问题:社会同侪的后见反馈(例如选择性相对于详尽无遗的)有何影响?该代理物如何从其经验重播中从非常罕见的语言目标实例中学习?多种探索形式如何结合起来,并利用较容易的目标作为更难实现的目标?为了解决这些问题,我们使用科学世界,这是一个具有丰富的抽象和组合物理学的文本环境。我们从社会同侪的反馈中显示出选择性的重要性;经验重现需要过多的稀有目标实例;以及遵循自我生成的目标序列,该代理物在最终业绩中具有显著的中间性能。</s>