The policy improvement bound on the difference of the discounted returns plays a crucial role in the theoretical justification of the trust-region policy optimization (TRPO) algorithm. The existing bound leads to a degenerate bound when the discount factor approaches one, making the applicability of TRPO and related algorithms questionable when the discount factor is close to one. We refine the results in \cite{Schulman2015, Achiam2017} and propose a novel bound that is "continuous" in the discount factor. In particular, our bound is applicable for MDPs with the long-run average rewards as well.
翻译:受贴现回报差异约束的政策改进在信任区域政策优化算法的理论依据中发挥着关键作用。 当贴现系数接近一个时,现有的约束会导致约束恶化,使TRPO和相关算法的适用性在贴现系数接近一个时产生疑问。我们完善了在\cite{Schulman2015,Achiam2017}中的结果,提出了在贴现系数中“连续”的新定义。特别是,我们的约束也适用于长期平均回报的MDPs。