The control variates (CV) method is widely used in policy gradient estimation to reduce the variance of the gradient estimators in practice. A control variate is applied by subtracting a baseline function from the state-action value estimates. Then the variance-reduced policy gradient presumably leads to higher learning efficiency. Recent research on control variates with deep neural net policies mainly focuses on scalar-valued baseline functions. The effect of vector-valued baselines is under-explored. This paper investigates variance reduction with coordinate-wise and layer-wise control variates constructed from vector-valued baselines for neural net policies. We present experimental evidence suggesting that lower variance can be obtained with such baselines than with the conventional scalar-valued baseline. We demonstrate how to equip the popular Proximal Policy Optimization (PPO) algorithm with these new control variates. We show that the resulting algorithm with proper regularization can achieve higher sample efficiency than scalar control variates in continuous control benchmarks.
翻译:在政策梯度估算中广泛使用控制变异方法,以降低实际中梯度估计值的差异。从国家行动值估算中减去基线函数,即可应用控制变异方法。然后,差异变异政策梯度可能提高学习效率。最近对控制变异的研究,加上深神经网政策,主要侧重于标定的基线功能。矢量估值基线的影响是探索不足的。本文用协调性和层次性控制变异方法调查差异减少,从神经网政策矢量估值基线中构建的矢量值测位基线中,可采用协调性和层次性控制变异方法。我们提出实验性证据,表明在这种基线中,差异可以低于传统的标定值基线。我们展示了如何用这些新的控制变异方法装备流行的普罗克西亚政策优化算法。我们表明,在连续控制基准中,经过适当规范的算法可以实现比卡拉控变法更高的样本效率。