Policy gradient methods can solve complex tasks but often fail when the dimensionality of the action-space or objective multiplicity grow very large. This occurs, in part, because the variance on score-based gradient estimators scales quadratically. In this paper, we address this problem through a factor baseline which exploits independence structure encoded in a novel action-target influence network. Factored policy gradients (FPGs), which follow, provide a common framework for analysing key state-of-the-art algorithms, are shown to generalise traditional policy gradients, and yield a principled way of incorporating prior knowledge of a problem domain's generative processes. We provide an analysis of the proposed estimator and identify the conditions under which variance is reduced. The algorithmic aspects of FPGs are discussed, including optimal policy factorisation, as characterised by minimum biclique coverings, and the implications for the bias-variance trade-off of incorrectly specifying the network. Finally, we demonstrate the performance advantages of our algorithm on large-scale bandit and traffic intersection problems, providing a novel contribution to the latter in the form of a spatial approximation.
翻译:政策梯度方法可以解决复杂的任务,但当行动空间或目标多样性的维度增长非常大时往往会失败。这在一定程度上是由于基于分数的梯度测算仪的差别而发生的。在本文件中,我们通过一个要素基线来解决这个问题,该要素基线利用了在新型行动目标影响网络中编码的独立结构。随后的因数政策梯度为分析关键的最新算法提供了一个共同框架,显示它概括了传统政策梯度,并产生了一种将问题域的基因化过程的先前知识纳入其中的原则性方法。我们分析了拟议的估测器,并确定了缩小差异的条件。讨论了FPG的算法方面,包括以最小的双曲线覆盖为特点的最佳政策因素化,以及错误地指定网络对偏差交易的影响。最后,我们展示了我们的算法在大型波段和交通交叉问题方面的性能优势,为后者提供了空间接近性的新贡献。