Policy gradient methods can solve complex tasks but often fail when the dimensionality of the action-space or objective multiplicity grow very large. This occurs, in part, because the variance on score-based gradient estimators scales quadratically with the number of targets. In this paper, we propose a causal baseline which exploits independence structure encoded in a novel action-target influence network. Causal policy gradients (CPGs), which follow, provide a common framework for analysing key state-of-the-art algorithms, are shown to generalise traditional policy gradients, and yield a principled way of incorporating prior knowledge of a problem domain's generative processes. We provide an analysis of the proposed estimator and identify the conditions under which variance is guaranteed to improve. The algorithmic aspects of CPGs are also discussed, including optimal policy factorisations, their complexity, and the use of conditioning to efficiently scale to extremely large, concurrent tasks. The performance advantages for two variants of the algorithm are demonstrated on large-scale bandit and concurrent inventory management problems.
翻译:政策梯度方法可以解决复杂的任务,但当行动空间或目标多样性的维度大幅增长时往往会失败。这在一定程度上是由于基于分数的梯度测算器与目标数量之间的四等比例差异所致。在本文件中,我们提出一个因果基线,利用在新型行动目标影响网络中编码的独立结构。随后的因果政策梯度为分析关键的最新算法提供了一个共同框架,显示它概括了传统政策梯度,并产生了一种将问题域的基因化过程纳入先前知识的有原则的方法。我们分析了拟议的测算器,并确定了保证改进差异的条件。我们还讨论了计算机梯度的算法方面,包括最佳政策因数化、其复杂性,以及利用条件高效率地缩小规模以完成极为庞大、同时的任务。两种变式算法的性能优势表现在大块和同时出现的库存管理问题上。