强化学习中临时协调探索的产生规划 (Generative Planning for Temporally Coordinated Exploration in Reinforcement Learning)

Standard model-free reinforcement learning algorithms optimize a policy that generates the action to be taken in the current time step in order to maximize expected future return. While flexible, it faces difficulties arising from the inefficient exploration due to its single step nature. In this work, we present Generative Planning method (GPM), which can generate actions not only for the current step, but also for a number of future steps (thus termed as generative planning). This brings several benefits to GPM. Firstly, since GPM is trained by maximizing value, the plans generated from it can be regarded as intentional action sequences for reaching high value regions. GPM can therefore leverage its generated multi-step plans for temporally coordinated exploration towards high value regions, which is potentially more effective than a sequence of actions generated by perturbing each action at single step level, whose consistent movement decays exponentially with the number of exploration steps. Secondly, starting from a crude initial plan generator, GPM can refine it to be adaptive to the task, which, in return, benefits future explorations. This is potentially more effective than commonly used action-repeat strategy, which is non-adaptive in its form of plans. Additionally, since the multi-step plan can be interpreted as the intent of the agent from now to a span of time period into the future, it offers a more informative and intuitive signal for interpretation. Experiments are conducted on several benchmark environments and the results demonstrated its effectiveness compared with several baseline methods.

翻译：标准型强化学习算法(GPM)优化了当前时间步骤将产生行动的政策,以便最大限度地实现预期的未来回报。虽然具有灵活性,但由于其单一步骤的性质,它面临着因低效率的勘探而带来的困难。在这项工作中,我们提出“创制规划方法”,不仅可以为当前步骤产生行动,而且可以为未来若干步骤(称为基因化规划)产生行动。这给GPM带来若干好处。首先,由于GPM是经过最大价值的培训,因此,从中产生的计划可以被视为达到高价值区域的有意行动序列。因此,GPM可以利用其生成的多步计划,为高价值区域进行时间协调的探索,这种计划可能比单步阶段一级渗透每项行动所产生的一系列行动(GPM)更为有效,因为每次行动的持续行动会随着勘探步骤的增多而急剧衰减。第二,GPMM可以从一个粗的初始计划启动者开始,使其适应这项任务,反过来,有利于未来的探索。这有可能比通常使用的行动修复战略更为有效,而现在该战略是非适应性协调的,在高价值区域进行,因此,这有可能比在单级阶段一级上产生一套行动性行动性探索性计划,从多级计划的形式,从多级计划开始,从一个解释,从多级基底基期,从一个展示,从多级计划开始,从一个展示,从多级计划开始,从一个示范,从一个示范计划,从一个展示到一个示范计划,从一个跨阶段到一个示范阶段到一个示范阶段,从一个示范阶段,从一个演示期,从一个演示期,从一个演示期,从一个展示一个展示一个跨基期,从一个示范性计划。