Approximate Policy Iteration (API) algorithms alternate between (approximate) policy evaluation and (approximate) greedification. Many different approaches have been explored for approximate policy evaluation, but less is understood about approximate greedification and what choices guarantee policy improvement. In this work, we investigate approximate greedification when reducing the KL divergence between the parameterized policy and the Boltzmann distribution over action values. In particular, we investigate the difference between the forward and reverse KL divergences, with varying degrees of entropy regularization. We show that the reverse KL has stronger policy improvement guarantees, but that reducing the forward KL can result in a worse policy. We also demonstrate, however, that a large enough reduction of the forward KL can induce improvement under additional assumptions. Empirically, we show on simple continuous-action environments that the forward KL can induce more exploration, but at the cost of a more suboptimal policy. No significant differences were observed in the discrete-action setting or on a suite of benchmark problems. Throughout, we highlight that many policy gradient methods can be seen as an instance of API, with either the forward or reverse KL for the policy update, and discuss next steps for understanding and improving our policy optimization algorithms.
翻译:在(近距离)政策评估和(近距离)贪婪之间互换的近似政策迭代(API)算法在(近距离)政策评估与(近距离)贪婪(近距离)贪婪之间互换。在大致政策评价方面,探索了许多不同的方法,但对于近距离贪婪和哪些选择可以保证政策改进理解较少。在这项工作中,我们在缩小参数化政策与布尔茨曼分布之间的KL差异时,调查了近距离贪婪。我们特别调查了前方和逆向KL差异之间的差别,并发现有不同程度的进化正规化。我们表明,反方向的KL有较强的政策改进保证,但减少前方的KL可能会导致更糟糕的政策。但我们也表明,在更多的假设下,大幅度削减前方的KL可以带来改进。我们平时,我们展示了简单的连续行动环境,即远方KL可以吸引更多的探索,但代价是更低级的政策。我们发现,在分散的行动设置或一系列基准问题上没有重大差异。我们从总体上强调,许多政策梯度方法可以被视为下一个API的例子,我们的政策更新了前方或逆方和最优化的政策步骤。