We develop several new algorithms for learning Markov Decision Processes in an infinite-horizon average-reward setting with linear function approximation. Using the optimism principle and assuming that the MDP has a linear structure, we first propose a computationally inefficient algorithm with optimal $\widetilde{O}(\sqrt{T})$ regret and another computationally efficient variant with $\widetilde{O}(T^{3/4})$ regret, where $T$ is the number of interactions. Next, taking inspiration from adversarial linear bandits, we develop yet another efficient algorithm with $\widetilde{O}(\sqrt{T})$ regret under a different set of assumptions, improving the best existing result by Hao et al. (2020) with $\widetilde{O}(T^{2/3})$ regret. Moreover, we draw a connection between this algorithm and the Natural Policy Gradient algorithm proposed by Kakade (2002), and show that our analysis improves the sample complexity bound recently given by Agarwal et al. (2020).
翻译:我们开发了几种新的算法,以在无限的一等正负平均回报环境中学习Markov 决策进程, 以线性函数近似值。 我们使用乐观原则, 假设 MDP 具有线性结构, 我们首先提出一种计算效率低的算法, 以最佳的$\ 全局性{O}( sqrt{T}) 表示遗憾, 并以$\ 全局性{O} (T ⁇ 3/4}) 表示遗憾, 并用另一种计算效率低的变方法, 以最优的$\ 全局性{O} (Sqrt{T}) 表示歉意。 此外, 我们把这一算法和Kakade (2002年) 提议的自然政策梯级算法联系起来, 并表明我们的分析提高了Agarwal 等人( 202020) 最近提供的样本复杂性。