Policy-gradient methods in Reinforcement Learning(RL) are very universal and widely applied in practice but their performance suffers from the high variance of the gradient estimate. Several procedures were proposed to reduce it including actor-critic(AC) and advantage actor-critic(A2C) methods. Recently the approaches have got new perspective due to the introduction of Deep RL: both new control variates(CV) and new sub-sampling procedures became available in the setting of complex models like neural networks. The vital part of CV-based methods is the goal functional for the training of the CV, the most popular one is the least-squares criterion of A2C. Despite its practical success, the criterion is not the only one possible. In this paper we for the first time investigate the performance of the one called Empirical Variance(EV). We observe in the experiments that not only EV-criterion performs not worse than A2C but sometimes can be considerably better. Apart from that, we also prove some theoretical guarantees of the actual variance reduction under very general assumptions and show that A2C least-squares goal functional is an upper bound for EV goal. Our experiments indicate that in terms of variance reduction EV-based methods are much better than A2C and allow stronger variance reduction.
翻译:强化学习(RL)的政策梯度方法非常普遍,并在实践中广泛应用,但其业绩因梯度估计差异很大而受到影响。提出了若干程序来减少这种方法,包括演员-批评(AC)和优劣的演员-批评(A2C)方法。最近,由于采用深REL,这些方法有了新的视角:新的控制变换(CV)和新的子抽样程序在设计像神经网络这样的复杂模型时都有,以CV为基础的方法的关键部分是CV培训的目标功能,最受欢迎的是A2C最差的标准。尽管它取得了实际成功,但标准并非唯一的可能。在本文中,我们首次调查了所谓的Emprical差异(EV)的绩效。我们从实验中看到,不仅EV-C标准的表现不比A2C标准更差,而且有时还可以大大改进。除此之外,我们还证明在理论上保证在一般假设下实际减少差异,并且表明A2C最低值标准是我们降低功能变差目标的上限。