Policy optimization methods are one of the most widely used classes of Reinforcement Learning (RL) algorithms. However, theoretical understanding of these methods remains insufficient. Even in the episodic (time-inhomogeneous) tabular setting, the state-of-the-art theoretical result of policy-based method in \citet{shani2020optimistic} is only $\tilde{O}(\sqrt{S^2AH^4K})$ where $S$ is the number of states, $A$ is the number of actions, $H$ is the horizon, and $K$ is the number of episodes, and there is a $\sqrt{SH}$ gap compared with the information theoretic lower bound $\tilde{\Omega}(\sqrt{SAH^3K})$. To bridge such a gap, we propose a novel algorithm Reference-based Policy Optimization with Stable at Any Time guarantee (\algnameacro), which features the property "Stable at Any Time". We prove that our algorithm achieves $\tilde{O}(\sqrt{SAH^3K} + \sqrt{AH^4})$ regret. When $S > H$, our algorithm is minimax optimal when ignoring logarithmic factors. To our best knowledge, RPO-SAT is the first computationally efficient, nearly minimax optimal policy-based algorithm for tabular RL.
翻译:政策优化方法是最广泛使用的强化学习算法类别之一。 但是,对于这些方法的理论理解仍然不够。 即便在( 时间- 不相容的) 列表设置中, 基于政策的方法在\ citet{ shani2020optimatistit} (sqrt{Sqrt{S2AH4K}) 中最先进的理论结果只是$tilde{O} (sqrt{Sqrt{Sqrt{2AH4K}) 。 为了弥补这种差距,我们提议在任何时间保证(\algenamecro)中采用基于新算法的缩略政策优化, 以美元表示行动的数量, 美元是地平线, 美元是事件的数量, 并且存在一个$\sqrt{Sh} 与较低约束的信息相对的基数 $tilde_Om} (sqrqrqrqral%) 。 我们的算法在任何时间保证( 时间表) 上都是最高级的缩缩缩缩的。