In this work, we study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation. We first show that the regret analysis of the Politex algorithm (a version of regularized policy iteration) can be sharpened from $O(T^{3/4})$ to $O(\sqrt{T})$ under nearly identical assumptions, and instantiate the bound with linear function approximation. Our result provides the first high-probability $O(\sqrt{T})$ regret bound for a computationally efficient algorithm in this setting. The exact implementation of Politex with neural network function approximation is inefficient in terms of memory and computation. Since our analysis suggests that we need to approximate the average of the action-value functions of past policies well, we propose a simple efficient implementation where we train a single Q-function on a replay buffer with past data. We show that this often leads to superior performance over other implementation choices, especially in terms of wall-clock time. Our work also provides a novel theoretical justification for using experience replay within policy iteration algorithms.
翻译:在这项工作中,我们用功能近似值来研究无限偏差的Markov 决策程序(MDPs)的学习算法。我们首先显示,对Politex 算法(常规化政策迭代的版本)的遗憾分析(Politex 算法)可以在几乎相同的假设下从$O(T ⁇ 3/4})提高到$O(Sqrt{T}),并用线性函数近似值对约束进行即时处理。我们的结果提供了第一个高概率的 $O(sqrt{T}) 。在这个设置中计算高效的算法。用神经网络函数近似法的精确执行在内存和计算方面是效率低下的。由于我们的分析表明,我们需要接近过去政策的行动价值函数的平均值,因此我们建议一个简单有效的执行方法,用过去的数据在重放缓冲器上训练一个单一的Q功能。我们显示,这往往导致比其他执行选择的高级性,特别是在墙时段时间。我们的工作也提供了一种新的理论理由,说明在政策内使用经验重置政策内的经验。