In this work we present an approach for building tight model-free confidence intervals for the optimal value function $V^\star$ in general infinite horizon MDPs via the upper solutions. We suggest a novel upper value iterative procedure (UVIP) to construct upper solutions for a given agent's policy. UVIP leads to a model free method of policy evaluation. We analyze convergence properties of the approximate UVIP under rather general assumptions and illustrate its performance on a number of benchmark RL problems.
翻译:在这项工作中,我们提出了一个办法,即通过上层解决方案,为通用无限地平面MDP的最佳价值功能($VZZZAStar$)建立严格的无模型信任间隔,我们建议采用新的上值迭代程序(UVIP),为特定代理人的政策建立上层解决方案。UVIP导致一种无政策评价模式。我们根据比较一般性的假设分析近似UVIP的趋同特性,并表明其在若干基准RL问题上的表现。