Model-free reinforcement learning algorithms combined with value function approximation have recently achieved impressive performance in a variety of application domains. However, the theoretical understanding of such algorithms is limited, and existing results are largely focused on episodic or discounted Markov decision processes (MDPs). In this work, we present adaptive approximate policy iteration (AAPI), a learning scheme which enjoys a $\tilde{O}(T^{2/3})$ regret bound for undiscounted, continuing learning in uniformly ergodic MDPs. This is an improvement over the best existing bound of $\tilde{O}(T^{3/4})$ for the average-reward case with function approximation. Our algorithm and analysis rely on online learning techniques, where value functions are treated as losses. The main technical novelty is the use of a data-dependent adaptive learning rate coupled with a so-called optimistic prediction of upcoming losses. In addition to theoretical guarantees, we demonstrate the advantages of our approach empirically on several environments.
翻译:与价值函数近似值相结合的无模型强化学习算法最近在各种应用领域取得了令人印象深刻的成绩。然而,对此类算法的理论理解有限,现有结果主要侧重于附带或折扣的马尔科夫决策程序。在这项工作中,我们提出了适应性近似政策迭代(API),这是一个学习计划,它享有一个无折扣的学习率,在统一电子流体MDP中继续学习,但对于目前存在的以功能近似值为平均回报的$(T ⁇ 3/4})的最佳约束,这是一个改进。我们的算法和分析依赖于在线学习技术,其中价值功能被视为损失。主要的技术新颖之处是使用依赖数据的适应性学习率,同时对即将发生的损失进行所谓的乐观预测。除了理论保证外,我们还以经验方式在多个环境中展示了我们的做法的优势。