提高强化学习效率的奖励办法 (Subgoal-based Reward Shaping to Improve Efficiency in Reinforcement Learning)

from arxiv, This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible. arXiv admin note: substantial text overlap with arXiv:2104.06163

Reinforcement learning, which acquires a policy maximizing long-term rewards, has been actively studied. Unfortunately, this learning type is too slow and difficult to use in practical situations because the state-action space becomes huge in real environments. Many studies have incorporated human knowledge into reinforcement Learning. Though human knowledge on trajectories is often used, a human could be asked to control an AI agent, which can be difficult. Knowledge on subgoals may lessen this requirement because humans need only to consider a few representative states on an optimal trajectory in their minds. The essential factor for learning efficiency is rewards. Potential-based reward shaping is a basic method for enriching rewards. However, it is often difficult to incorporate subgoals for accelerating learning over potential-based reward shaping. This is because the appropriate potentials are not intuitive for humans. We extend potential-based reward shaping and propose a subgoal-based reward shaping. The method makes it easier for human trainers to share their knowledge of subgoals. To evaluate our method, we obtained a subgoal series from participants and conducted experiments in three domains, four-rooms(discrete states and discrete actions), pinball(continuous and discrete), and picking(both continuous). We compared our method with a baseline reinforcement learning algorithm and other subgoal-based methods, including random subgoal and naive subgoal-based reward shaping. As a result, we found out that our reward shaping outperformed all other methods in learning efficiency.

翻译：强化学习是获得最大长期奖赏的政策,已经进行了积极研究。不幸的是,这种学习类型太慢,难以在实际情况下使用,因为国家行动空间在现实环境中变得巨大。许多研究已经将人类知识纳入强化学习。虽然经常使用关于轨迹的人类知识,但可以要求人类控制一个可能很困难的AI代理。关于次级目标的知识可能减少这一要求,因为人类只需要考虑少数几个有代表性的国家,以最佳思维轨迹为最佳轨道。学习效率的基本因素是奖励。基于潜在奖励的形成是丰富奖励的一种基本方法。然而,往往难以将加快学习以潜在奖励为目的的次级目标纳入到强化学习中。这是因为,尽管人类在轨迹上的人类知识常常被使用,但可以要求人类控制一个基于潜在的奖励的基于次级目标的奖励机构,而这种方法可以使人类培训者更容易分享他们关于子目标的知识。为了评估我们的方法,我们从参与者那里获得了一个子目标系列,并在三个领域、四个房间(不稳定的)进行实验,是丰富奖金的基本方法。然而,往往难以纳入加速学习以潜在奖励为潜在目标的次级目标的次级目标。采用一种不切入式的方法,从我们更深层次的、更深层次和更深层次的学习方法,以及更深层的、更深层的、更深层的学习方法,以及更深层的、更深层的、更深层的学习方法,包括我们更深层的学习方法。