Reinforcement learning (RL) in long horizon and sparse reward tasks is notoriously difficult and requires a lot of training steps. A standard solution to speed up the process is to leverage additional reward signals, shaping it to better guide the learning process. In the context of language-conditioned RL, the abstraction and generalisation properties of the language input provide opportunities for more efficient ways of shaping the reward. In this paper, we leverage this idea and propose an automated reward shaping method where the agent extracts auxiliary objectives from the general language goal. These auxiliary objectives use a question generation (QG) and question answering (QA) system: they consist of questions leading the agent to try to reconstruct partial information about the global goal using its own trajectory. When it succeeds, it receives an intrinsic reward proportional to its confidence in its answer. This incentivizes the agent to generate trajectories which unambiguously explain various aspects of the general language goal. Our experimental study shows that this approach, which does not require engineer intervention to design the auxiliary objectives, improves sample efficiency by effectively directing exploration.
翻译:长期强化学习(RL)和微薄的奖励任务非常困难,需要许多培训步骤。加速这一过程的标准解决办法是利用额外的奖励信号,使其形成更能指导学习过程。在语言条件的RL中,语言投入的抽象和概括性质为更有效地塑造奖励提供了机会。在本文中,我们利用这一想法并提出一种自动奖励制成方法,使代理商从一般语言目标中提取辅助目标。这些辅助目标使用一个问题生成(QG)和问题回答(QA)系统:它们包括促使代理商尝试利用自己的轨迹重建关于全球目标的部分信息的问题。一旦成功,它将获得与其对答案的信心相称的内在奖励。这鼓励该代理商产生轨迹,明确解释一般语言目标的各个方面。我们的实验研究表明,这一方法不需要工程师干预来设计辅助目标,通过有效指导勘探提高抽样效率。