Policy optimization is a fundamental principle for designing reinforcement learning algorithms, and one example is the proximal policy optimization algorithm with a clipped surrogate objective (PPO-Clip), which has been popularly used in deep reinforcement learning due to its simplicity and effectiveness. Despite its superior empirical performance, PPO-Clip has not been justified via theoretical proof up to date. In this paper, we establish the first global convergence rate of PPO-Clip under neural function approximation. We identify the fundamental challenges of analyzing PPO-Clip and address them with the two core ideas: (i) We reinterpret PPO-Clip from the perspective of hinge loss, which connects policy improvement with solving a large-margin classification problem with hinge loss and offers a generalized version of the PPO-Clip objective. (ii) Based on the above viewpoint, we propose a two-step policy improvement scheme, which facilitates the convergence analysis by decoupling policy search from the complex neural policy parameterization with the help of entropic mirror descent and a regression-based policy update scheme. Moreover, our theoretical results provide the first characterization of the effect of the clipping mechanism on the convergence of PPO-Clip. Through experiments, we empirically validate the reinterpretation of PPO-Clip and the generalized objective with various classifiers on various RL benchmark tasks.
翻译:政策优化是设计强化学习算法的一项基本原则,一个例子是近似政策优化算法,带有一个剪接代谢目标(PPO-Clip),由于它的简单性和有效性,在深入强化学习中被普遍使用。尽管PPO-Clip业绩优异,但通过最新理论证明,PPPO-Clip没有正当理由。在本文中,我们根据神经功能近似,建立了PPPO-Clip的第一个全球趋同率。我们确定了分析PPPO-Clip的基本挑战,并用两个核心想法加以解决:(一) 我们从临界损失的角度重新解释PPOP-Clip,将政策改进与解决大边缘分类问题和断链损失联系起来,并提供PPPO-Clip目标的通用版本。 (二) 根据上述观点,我们提出了两步政策改进计划,通过将政策搜索与复杂的神经政策参数参数的分离和基于回归的政策更新计划计划计划计划,我们首先对PPPO-C的标准化标准进行了定性。