在线深层强化学习适应性和多重时间尺度的资格追踪 (Adaptive and Multiple Time-scale Eligibility Traces for Online Deep Reinforcement Learning)

Deep reinforcement learning (DRL) is one promising approach to teaching robots to perform complex tasks. Because methods that directly reuse the stored experience data cannot follow the change of the environment in robotic problems with a time-varying environment, online DRL is required. The eligibility traces method is well known as an online learning technique for improving sample efficiency in traditional reinforcement learning with linear regressors rather than DRL. The dependency between parameters of deep neural networks would destroy the eligibility traces, which is why they are not integrated with DRL. Although replacing the gradient with the most influential one rather than accumulating the gradients as the eligibility traces can alleviate this problem, the replacing operation reduces the number of reuses of previous experiences. To address these issues, this study proposes a new eligibility traces method that can be used even in DRL while maintaining high sample efficiency. When the accumulated gradients differ from those computed using the latest parameters, the proposed method takes into account the divergence between the past and latest parameters to adaptively decay the eligibility traces. Bregman divergences between outputs computed by the past and latest parameters are exploited due to the infeasible computational cost of the divergence between the past and latest parameters. In addition, a generalized method with multiple time-scale traces is designed for the first time. This design allows for the replacement of the most influential adaptively accumulated (decayed) eligibility traces.

翻译：深度强化学习(DRL)是教育机器人执行复杂任务的一种很有希望的方法。因为直接再利用存储的经验数据的方法无法随着机器人问题环境的变化而随着时间变化而随着机器人问题的变化而发生变化,因此需要在线DRL。资格跟踪方法被广泛称为一种在线学习技术,用线性递减器而不是DRL来提高传统强化学习的样本效率。深神经网络参数之间的依赖性会破坏资格跟踪,这就是它们没有与DRL结合的原因。虽然用最有影响力的梯度取代梯度,而不是积累梯度,因为资格跟踪可以缓解这一问题,但替换操作会减少以前经验的再利用次数。为解决这些问题,本研究提出了一种新的资格跟踪方法,即使在DRL中也可以使用,同时保持高的样本效率。当累积的梯度与使用最新参数计算的参数不同时,拟议方法会考虑到过去参数和最新参数在适应性衰减资格跟踪方面的差异。根据过去和最新参数计算的产出之间的差异,由于资格跟踪的计算成本是无法做到的,因此,替换操作会减少以前经验的再利用次数。为了最有影响力的升级,因此,可以将最近设计的标准进行升级。