Policy gradient methods can solve complex tasks but often fail when the dimensionality of the action-space or objective multiplicity grow very large. This occurs, in part, because the variance on score-based gradient estimators scales quadratically. In this paper, we address this problem through a causal baseline which exploits independence structure encoded in a novel action-target influence network. Causal policy gradients (CPGs), which follow, provide a common framework for analysing key state-of-the-art algorithms, are shown to generalise traditional policy gradients, and yield a principled way of incorporating prior knowledge of a problem domain's generative processes. We provide an analysis of the proposed estimator and identify the conditions under which variance is reduced. The algorithmic aspects of CPGs are discussed, including optimal policy factorisation, as characterised by minimum biclique coverings, and the implications for the bias-variance trade-off of incorrectly specifying the network. Finally, we demonstrate the performance advantages of our algorithm on large-scale bandit and traffic intersection problems, providing a novel contribution to the latter in the form of a spatio-causal approximation.
翻译:政策梯度方法可以解决复杂的任务,但当行动空间或目标多样性的维度增长非常大时往往会失败。这在一定程度上是由于基于分数的梯度测算仪的差别而发生的。在本文件中,我们通过一个因果基线来解决这一问题,该因果基线利用了在新型行动目标影响网络中编码的独立结构。随后的因果政策梯度为分析关键的最新算法提供了一个共同框架,显示它概括了传统政策梯度,并产生了一种将问题域的基因化过程事先知识纳入其中的原则性方法。我们分析了拟议的估量器,并确定了缩小差异的条件。讨论了计算方法的算法方面,包括以最小的两极覆盖为特征的最佳政策因子化,以及错误指定网络对偏差交易的影响。最后,我们展示了我们计算方法在大型波段和交通交叉问题方面的性能优势,为后者提供了新的空间缩略图形式。