Training agents via off-policy deep reinforcement learning (RL) requires a large memory, named replay memory, that stores past experiences used for learning. These experiences are sampled, uniformly or non-uniformly, to create the batches used for training. When calculating the loss function, off-policy algorithms assume that all samples are of the same importance. In this paper, we hypothesize that training can be enhanced by assigning different importance for each experience based on their temporal-difference (TD) error directly in the training objective. We propose a novel method that introduces a weighting factor for each experience when calculating the loss function at the learning stage. In addition to improving convergence speed when used with uniform sampling, the method can be combined with prioritization methods for non-uniform sampling. Combining the proposed method with prioritization methods improves sampling efficiency while increasing the performance of TD-based off-policy RL algorithms. The effectiveness of the proposed method is demonstrated by experiments in six environments of the OpenAI Gym suite. The experimental results demonstrate that the proposed method achieves a 33%~76% reduction of convergence speed in three environments and an 11% increase in returns and a 3%~10% increase in success rate for other three environments.
翻译:通过政策外深层强化学习(RL)培训代理商,通过政策外深层强化学习(RL),需要大量记忆(称为回放记忆),存储以往学习经验。这些经验是抽样、统一或不统一的,以创建培训所用的批次。在计算损失函数时,政策外算法假定所有样本都具有同等重要性。在本文中,我们假设培训可以通过直接根据培训目标中的时间差异(TD)错误对每项经验给予不同重视而得到加强。我们提出了一个新颖的方法,在计算学习阶段损失函数时,为每个经验引入一个加权系数。在统一取样时,除了提高趋同速度外,该方法还可以与非统一抽样的优先排序方法相结合。将拟议方法与优先排序方法结合起来,可以提高取样效率,同时提高基于TD的离动RL算法的性能。在OpenAI Gym套房的六个环境中的实验证明了拟议方法的有效性。实验结果表明,在三个环境中,在计算损失函数时,在三个环境中,将合并速度降低33 ⁇ 76%,在其他环境中,成功率增加11%。