While policy-based reinforcement learning (RL) achieves tremendous successes in practice, it is significantly less understood in theory, especially compared with value-based RL. In particular, it remains elusive how to design a provably efficient policy optimization algorithm that incorporates exploration. To bridge such a gap, this paper proposes an Optimistic variant of the Proximal Policy Optimization algorithm (OPPO), which follows an "optimistic version" of the policy gradient direction. This paper proves that, in the problem of episodic Markov decision process with linear function approximation, unknown transition, and adversarial reward with full-information feedback, OPPO achieves $\tilde{O}(\sqrt{d^3 H^3 T})$ regret. Here $d$ is the feature dimension, $H$ is the episode horizon, and $T$ is the total number of steps. To the best of our knowledge, OPPO is the first provably efficient policy optimization algorithm that explores.
翻译:虽然基于政策的强化学习在实践中取得了巨大的成功,但在理论上却远远不那么为人所理解,特别是与基于价值的学习相比。 特别是,如何设计一种将勘探纳入其中的、可实现的高效的政策优化算法仍然难以理解。 为弥补这一差距,本文件提出了一种最佳政策优化算法的优化变体,它遵循的是政策梯度方向的“乐观版 ” 。 本文证明,在具有线性功能近似、未知的过渡和以完整信息反馈进行对抗性奖励的附带性马可多夫决策过程的问题中,OPPO实现了$\tilde{O}(sqrt{3H{3T})美元(sqrt{3H}) 。 在这里,美元是特征的维度,$H是事件地平线,$T是步骤的总数。据我们所知,OPPO是第一个探索的、可实现效率的政策优化算法。