Reward design is a critical part of the application of reinforcement learning, the performance of which strongly depends on how well the reward signal frames the goal of the designer and how well the signal assesses progress in reaching that goal. In many cases, the extrinsic rewards provided by the environment (e.g., win or loss of a game) are very sparse and make it difficult to train agents directly. Researchers usually assist the learning of agents by adding some auxiliary rewards in practice. However, designing auxiliary rewards is often turned to a trial-and-error search for reward settings that produces acceptable results. In this paper, we propose to automatically generate goal-consistent intrinsic rewards for the agent to learn, by maximizing which the expected accumulative extrinsic rewards can be maximized. To this end, we introduce the concept of motivation which captures the underlying goal of maximizing certain rewards and propose the motivation based reward design method. The basic idea is to shape the intrinsic rewards by minimizing the distance between the intrinsic and extrinsic motivations. We conduct extensive experiments and show that our method performs better than the state-of-the-art methods in handling problems of delayed reward, exploration, and credit assignment.
翻译:奖励设计是应用强化学习的一个关键部分,其业绩在很大程度上取决于奖励信号如何很好地设定了设计者的目标,以及信号如何很好地评估了实现这一目标的进展。在许多情况下,环境提供的外部奖励(如游戏的赢或输)非常稀少,难以直接培训代理人。研究人员通常通过在实践中增加一些辅助奖励来帮助代理者学习。然而,设计辅助奖励往往转向试探产生可接受结果的奖励设置。在本文件中,我们提议通过最大限度地实现预期的累积性外部奖励(如游戏的赢或输),自动为代理者创造符合目标的内在奖赏。为此,我们提出动机概念,抓住最大限度地获得某些奖励的基本目标,并提出以奖励为动机的设计方法。基本想法是通过尽量减少内在动机与外部动机之间的距离来塑造内在的奖赏。我们进行广泛的实验,并表明我们的方法比处理延迟勘探、信用分配和问题的国家方法要好。