利用内在动力改进自我学习探索 (Towards Improving Exploration in Self-Imitation Learning using Intrinsic Motivation)

Reinforcement Learning has emerged as a strong alternative to solve optimization tasks efficiently. The use of these algorithms highly depends on the feedback signals provided by the environment in charge of informing about how good (or bad) the decisions made by the learned agent are. Unfortunately, in a broad range of problems the design of a good reward function is not trivial, so in such cases sparse reward signals are instead adopted. The lack of a dense reward function poses new challenges, mostly related to exploration. Imitation Learning has addressed those problems by leveraging demonstrations from experts. In the absence of an expert (and its subsequent demonstrations), an option is to prioritize well-suited exploration experiences collected by the agent in order to bootstrap its learning process with good exploration behaviors. However, this solution highly depends on the ability of the agent to discover such trajectories in the early stages of its learning process. To tackle this issue, we propose to combine imitation learning with intrinsic motivation, two of the most widely adopted techniques to address problems with sparse reward. In this work intrinsic motivation is used to encourage the agent to explore the environment based on its curiosity, whereas imitation learning allows repeating the most promising experiences to accelerate the learning process. This combination is shown to yield an improved performance and better generalization in procedurally-generated environments, outperforming previously reported self-imitation learning methods and achieving equal or better sample efficiency with respect to intrinsic motivation in isolation.

翻译：强化学习是有效解决优化任务的有力替代方法。这些算法的使用高度取决于负责告知被学习者所作决定的好坏的环境提供的反馈信号。不幸的是,在一系列广泛的问题中,良好的奖励功能的设计并非微不足道,因此在这类情况下采用微弱的奖励信号。缺乏密集的奖励功能带来了新的挑战,大多与探索有关。模拟学习通过利用专家的示范来解决这些问题。在缺乏专家(以及随后的演示)的情况下,一种选择是优先考虑代理人收集的适合的勘探经验,以便用良好的探索行为引导其学习过程。然而,这一解决办法在很大程度上取决于代理人在学习过程的早期阶段发现这种良好的奖励功能的能力,因此,我们建议将模仿学习与内在动机相结合,这是最广泛采用的两种解决微弱奖励问题的技术。在这项工作中,利用内在动机鼓励代理人根据好奇心探索环境,而模仿学习能够重复最有希望的经验,从而加速以往学习过程的自我学习过程。这种学习方式与改进的学习过程的内在动力相结合,在改进之前的学习过程中显示一种更好的学习方式。