While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an ``optimistic version'' of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves $\tilde{O}(\sqrt{d^2 H^3 T} )$ regret. Here $d$ is the feature dimension, $H$ is the episode horizon, and $T$ is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
翻译:虽然基于政策的强化学习在实践中取得了巨大的成功,但在理论上却远远不那么为人所理解,特别是与基于价值的学习相比。 特别是,如何设计一种可实现的高效政策优化算法以纳入勘探。为弥补这一差距,本文件提出了“最佳政策优化算法”的优化变体,该算法遵循了政策梯度方向的“乐观版 ” 。 本文证明,在具有线性功能近似、未知的过渡和以完整信息反馈进行对抗性奖励的Supsodic Markov决策过程的问题中,OPPO取得了$\tilde{O}(sqrt{d ⁇ 2H}T})美元。 此处的特征部分是$d$,这是“情境”,而$T是步骤的总数。 据我们所知,OPPO是第一个探索的、可证实有效的政策优化算法。