Building open-ended agents that can autonomously discover a diversity of behaviours is one of the long-standing goals of artificial intelligence. This challenge can be studied in the framework of autotelic RL agents, i.e. agents that learn by selecting and pursuing their own goals, self-organizing a learning curriculum. Recent work identified language has a key dimension of autotelic learning, in particular because it enables abstract goal sampling and guidance from social peers for hindsight relabelling. Within this perspective, we study the following open scientific questions: What is the impact of hindsight feedback from a social peer (e.g. selective vs. exhaustive)? How can the agent learn from very rare language goal examples in its experience replay? How can multiple forms of exploration be combined, and take advantage of easier goals as stepping stones to reach harder ones? To address these questions, we use ScienceWorld, a textual environment with rich abstract and combinatorial physics. We show the importance of selectivity from the social peer's feedback; that experience replay needs to over-sample examples of rare goals; and that following self-generated goal sequences where the agent's competence is intermediate leads to significant improvements in final performance.
翻译:建立能够自主发现多种行为的开放性代理物,这是人工智能的长期目标之一。这个挑战可以在自动RL代理物的框架内研究,即通过选择和追求自己的目标、自我组织学习学习课程的代理物。最近的工作确定的语言具有自动学习的一个关键方面,特别是因为它能够让社会同辈人抽象地进行目标抽样并提供指导,以便进行后视再贴标签。在这个角度上,我们研究以下开放性科学问题:社会同辈人的后见反馈(例如选择性相对于详尽无遗的)有何影响?该代理物如何从其经验重播中从非常罕见的语言目标实例中学习?多种探索形式如何结合起来,并利用较容易的目标作为更难实现的目标?为了解决这些问题,我们使用科学世界,这是一个具有丰富的抽象和组合物理学的文本环境。我们从社会同辈的反馈中显示出选择性的重要性;经验重现需要过多的稀有目标实例;以及遵循自我生成的目标序列,使该代理人的能力在最终业绩中具有显著的中间性表现。