Reinforcement learning algorithms require many samples when solving complex hierarchical tasks with sparse and delayed rewards. For such complex tasks, the recently proposed RUDDER uses reward redistribution to leverage steps in the Q-function that are associated with accomplishing sub-tasks. However, often only few episodes with high rewards are available as demonstrations since current exploration strategies cannot discover them in reasonable time. In this work, we introduce Align-RUDDER, which utilizes a profile model for reward redistribution that is obtained from multiple sequence alignment of demonstrations. Consequently, Align-RUDDER employs reward redistribution effectively and, thereby, drastically improves learning on few demonstrations. Align-RUDDER outperforms competitors on complex artificial tasks with delayed rewards and few demonstrations. On the Minecraft ObtainDiamond task, Align-RUDDER is able to mine a diamond, though not frequently. Code is available at https://github.com/ml-jku/align-rudder. YouTube: https://youtu.be/HO-_8ZUl-UY
翻译:强化学习算法在解决复杂的等级任务时需要许多样本,而报酬却少而迟。对于这些复杂的任务,最近提议的RUDDER利用奖励再分配来利用与完成次级任务有关的Q职能步骤。然而,由于目前的勘探战略无法在合理时间内发现这些成果,往往只有很少的高回报事件作为示范品。在这项工作中,我们引入了Aleign-RUDDER,它利用了从多重顺序调整示威中获得的奖赏再分配的概况模型。因此,Align-RUDDER有效地利用了奖励再分配,从而大大改进了对少数示范的学习。Align-RUDDER在复杂的人工任务上优于竞争者,但获得延迟的奖赏和演示很少。关于Minecraft Diamond的任务,Align-RUDDER能够开采钻石,尽管并不经常。代码可在https://github.com/ml-jku/align-ruder.YouTube上查阅。https://yout.be/HO-8ZUl-UY。