We extend the options framework for temporal abstraction in reinforcement learning from discounted Markov decision processes (MDPs) to average-reward MDPs. Our contributions include general convergent off-policy inter-option learning algorithms, intra-option algorithms for learning values and models, as well as sample-based planning variants of our learning algorithms. Our algorithms and convergence proofs extend those recently developed by Wan, Naik, and Sutton. We also extend the notion of option-interrupting behavior from the discounted to the average-reward formulation. We show the efficacy of the proposed algorithms with experiments on a continuing version of the Four-Room domain.
翻译:我们扩展了从贴现的Markov决策程序(MDPs)中强化学习的时间抽象选择框架,到平均回报的MDPs。我们的贡献包括一般的离政策互换学习算法、学习价值和模型的自选算法,以及我们学习算法的抽样规划变量。我们的算法和趋同证明延伸了Wan、Naik和Sutton最近开发的那些。我们还将选择干扰行为的概念从折扣到平均回报的提法。我们展示了提议的算法的有效性,并实验了四楼域的持续版本。