Learning to solve sparse-reward reinforcement learning problems is difficult, due to the lack of guidance towards the goal. But in some problems, prior knowledge can be used to augment the learning process. Reward shaping is a way to incorporate prior knowledge into the original reward function in order to speed up the learning. While previous work has investigated the use of expert knowledge to generate potential functions, in this work, we study whether we can use a search algorithm(A*) to automatically generate a potential function for reward shaping in Sokoban, a well-known planning task. The results showed that learning with shaped reward function is faster than learning from scratch. Our results indicate that distance functions could be a suitable function for Sokoban. This work demonstrates the possibility of solving multiple instances with the help of reward shaping. The result can be compressed into a single policy, which can be seen as the first phrase towards training a general policy that is able to solve unseen instances.
翻译:由于缺乏对目标的指导, 学习解决微薄回报强化学习问题是困难的, 但在某些问题中, 先前的知识可以用来扩大学习过程。 奖励的形成是将先前的知识纳入原始奖励功能的一种方式, 以便加速学习。 虽然先前的工作调查了利用专家知识产生潜在功能的问题, 但在这项工作中, 我们研究我们是否可以使用搜索算法(A*) 来自动产生一种潜在功能, 以奖励索科班的形成, 这是一种众所周知的规划任务。 结果显示, 以塑造的奖励功能学习比从零开始学习要快。 我们的结果表明, 远程功能可能是Sokoban 的合适功能。 这项工作表明, 利用奖励塑造来解决多个案例的可能性。 其结果可以压缩为单一的政策, 这可以被视为培训能够解决不可见实例的一般政策的第一个短语 。