In the last few years, the research activity around reinforcement learning tasks formulated over environments with sparse rewards has been especially notable. Among the numerous approaches proposed to deal with these hard exploration problems, intrinsic motivation mechanisms are arguably among the most studied alternatives to date. Advances reported in this area over time have tackled the exploration issue by proposing new algorithmic ideas to generate alternative mechanisms to measure the novelty. However, most efforts in this direction have overlooked the influence of different design choices and parameter settings that have also been introduced to improve the effect of the generated intrinsic bonus, forgetting the application of those choices to other intrinsic motivation techniques that may also benefit of them. Furthermore, some of those intrinsic methods are applied with different base reinforcement algorithms (e.g. PPO, IMPALA) and neural network architectures, being hard to fairly compare the provided results and the actual progress provided by each solution. The goal of this work is to stress on this crucial matter in reinforcement learning over hard exploration environments, exposing the variability and susceptibility of avant-garde intrinsic motivation techniques to diverse design factors. Ultimately, our experiments herein reported underscore the importance of a careful selection of these design aspects coupled with the exploration requirements of the environment and the task in question under the same setup, so that fair comparisons can be guaranteed.
翻译:在过去几年里,围绕强化学习任务开展的研究活动特别引人注目,这些研究活动围绕的是针对缺乏回报的环境制定的强化学习任务,在为处理这些艰苦的勘探问题而提出的许多办法中,内在激励机制是迄今为止研究最多的替代办法之一。该领域报告的进展长期以来解决了勘探问题,提出了新的算法想法,以产生替代机制来衡量新颖性。然而,这一方向上的大部分努力忽视了不同的设计选择和参数设置的影响,这些选择也是为了改进产生的内在红利的效果,忘记将这些选择应用于其他可能也对其有益的内在激励技术。此外,其中一些内在方法应用了不同的基础强化算法(如PPPO、IMALA)和神经网络结构,很难公平地比较所提供的结果和每种解决办法提供的实际进展。这项工作的目的是强调这一关键问题,加强在艰苦的勘探环境中的学习,使先入为主的内在激励技术的变异性和易感性适用于不同的设计因素。最后,我们这里的实验强调仔细选择这些设计方法的重要性,同时进行公平的环境勘探要求和任务的比较。