Trust Region Policy Optimization (TRPO) is an iterative method that simultaneously maximizes a surrogate objective and enforces a trust region constraint over consecutive policies in each iteration. The combination of the surrogate objective maximization and the trust region enforcement has been shown to be crucial to guarantee a monotonic policy improvement. However, solving a trust-region-constrained optimization problem can be computationally intensive as it requires many steps of conjugate gradient and a large number of on-policy samples. In this paper, we show that the trust region constraint over policies can be safely substituted by a trust-region-free constraint without compromising the underlying monotonic improvement guarantee. The key idea is to generalize the surrogate objective used in TRPO in a way that a monotonic improvement guarantee still emerges as a result of constraining the maximum advantage-weighted ratio between policies. This new constraint outlines a conservative mechanism for iterative policy optimization and sheds light on practical ways to optimize the generalized surrogate objective. We show that the new constraint can be effectively enforced by being conservative when optimizing the generalized objective function in practice. We call the resulting algorithm Trust-REgion-Free Policy Optimization (TREFree) as it is free of any explicit trust region constraints. Empirical results show that TREFree outperforms TRPO and Proximal Policy Optimization (PPO) in terms of policy performance and sample efficiency.
翻译:区域信任政策优化是一个迭代方法,既能最大限度地实现代用目标,又能对每个迭代的连续政策实施信任区域的限制。代用目标最大化和信任区域执法相结合,对于保证单调政策改进至关重要。然而,解决信任区域限制的优化问题可以进行密集计算,因为它要求采取许多步骤,调合梯度和大量政策样本。在本文件中,我们表明信任区域对政策的限制可以安全地用信任区域的限制来取代,而不会损害基本的单调改进保证。关键的想法是将代用目标最大化和信任区域执法相结合,从而保证单调政策改进至关重要。但是,由于限制政策之间最大的优势加权比率,因此仍然会出现单一化的改进保证。这一新的限制概括了反复调整政策优化的保守机制,并为优化普遍替代目标的实际方法提供了光亮的光亮。我们表明,在优化普遍目标功能时,在不损害基本的单一性改进后,新的制约可以通过保守性来有效地实施新的限制。我们称,在实行自由信任政策优化政策优化后,这种自由的克制是信任-信任。