We propose a new method for optimistic planning in infinite-horizon discounted Markov decision processes based on the idea of adding regularization to the updates of an otherwise standard approximate value iteration procedure. This technique allows us to avoid contraction and monotonicity arguments that are typically required by existing analyses of approximate dynamic programming methods, and in particular to use approximate transition functions estimated via least-squares procedures in MDPs with linear function approximation. We use our method to provide a computationally efficient algorithm for learning near-optimal policies in discounted linear kernel MDPs from a single stream of experience, and show that it achieves near-optimal statistical guarantees.
翻译:我们基于在更新其他标准近似值迭代程序时增加正规化的构想,提出了在无限偏差贴现的Markov决策程序中进行乐观规划的新方法。 这种方法使我们能够避免现有对近似动态编程方法的分析通常要求的收缩和单一性论点,特别是使用具有线性函数近似的MDP中通过最小平方程序估计的大致过渡功能。 我们使用的方法提供了一种计算高效的算法,用于从单一的经验流中学习折扣线性线性内核 MDP 中接近最佳的政策,并显示它实现了近于最佳的统计保障。</s>