按时间差差错加权的非政策强化学习与损失 (Off-Policy Reinforcement Learning with Loss Function Weighted by Temporal Difference Error)

Training agents via off-policy deep reinforcement learning (RL) requires a large memory, named replay memory, that stores past experiences used for learning. These experiences are sampled, uniformly or non-uniformly, to create the batches used for training. When calculating the loss function, off-policy algorithms assume that all samples are of the same importance. In this paper, we hypothesize that training can be enhanced by assigning different importance for each experience based on their temporal-difference (TD) error directly in the training objective. We propose a novel method that introduces a weighting factor for each experience when calculating the loss function at the learning stage. In addition to improving convergence speed when used with uniform sampling, the method can be combined with prioritization methods for non-uniform sampling. Combining the proposed method with prioritization methods improves sampling efficiency while increasing the performance of TD-based off-policy RL algorithms. The effectiveness of the proposed method is demonstrated by experiments in six environments of the OpenAI Gym suite. The experimental results demonstrate that the proposed method achieves a 33%~76% reduction of convergence speed in three environments and an 11% increase in returns and a 3%~10% increase in success rate for other three environments.

翻译：通过政策外深层强化学习(RL)培训代理商,通过政策外深层强化学习(RL),需要大量记忆(称为回放记忆),存储以往学习经验。这些经验是抽样、统一或不统一的,以创建培训所用的批次。在计算损失函数时,政策外算法假定所有样本都具有同等重要性。在本文中,我们假设培训可以通过直接根据培训目标中的时间差异(TD)错误对每项经验给予不同重视而得到加强。我们提出了一个新颖的方法,在计算学习阶段损失函数时,为每个经验引入一个加权系数。在统一取样时,除了提高趋同速度外,该方法还可以与非统一抽样的优先排序方法相结合。将拟议方法与优先排序方法结合起来,可以提高取样效率,同时提高基于TD的离动RL算法的性能。在OpenAI Gym套房的六个环境中的实验证明了拟议方法的有效性。实验结果表明,在三个环境中,在计算损失函数时,在三个环境中,将合并速度降低33 ⁇ 76%,在其他环境中,成功率增加11%。

相关内容

损失函数（机器学习）

关注 10

损失函数，在AI中亦称呼距离函数，度量函数。此处的距离代表的是抽象性的，代表真实数据与预测数据之间的误差。损失函数（loss function）是用来估量你模型的预测值f(x)与真实值Y的不一致程度，它是一个非负实值函数,通常使用L(Y, f(x))来表示，损失函数越小，模型的鲁棒性就越好。损失函数是经验风险函数的核心部分，也是结构风险函数重要组成部分。

【MIla】一种意识启发规划的基于模型强化学习，A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning

专知会员服务

23+阅读 · 2022年3月19日

【干货书】深度学习合成数据，354页pdf，Synthetic Data for Deep Learning

专知会员服务

104+阅读 · 2022年2月10日

强化学习的对比无监督表示，CURL: Contrastive Unsupervised Representations for Reinforcement Learning

专知会员服务

41+阅读 · 2020年4月11日

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日