We consider approximate dynamic programming in $\gamma$-discounted Markov decision processes and apply it to approximate planning with linear value-function approximation. Our first contribution is a new variant of Approximate Policy Iteration (API), called Confident Approximate Policy Iteration (CAPI), which computes a deterministic stationary policy with an optimal error bound scaling linearly with the product of the effective horizon $H$ and the worst-case approximation error $\epsilon$ of the action-value functions of stationary policies. This improvement over API (whose error scales with $H^2$) comes at the price of an $H$-fold increase in memory cost. Unlike Scherrer and Lesner [2012], who recommended computing a non-stationary policy to achieve a similar improvement (with the same memory overhead), we are able to stick to stationary policies. This allows for our second contribution, the application of CAPI to planning with local access to a simulator and $d$-dimensional linear function approximation. As such, we design a planning algorithm that applies CAPI to obtain a sequence of policies with successively refined accuracies on a dynamically evolving set of states. The algorithm outputs an $\tilde O(\sqrt{d}H\epsilon)$-optimal policy after issuing $\tilde O(dH^4/\epsilon^2)$ queries to the simulator, simultaneously achieving the optimal accuracy bound and the best known query complexity bound, while earlier algorithms in the literature achieve only one of them. This query complexity is shown to be tight in all parameters except $H$. These improvements come at the expense of a mild (polynomial) increase in memory and computational costs of both the algorithm and its output policy.
翻译:我们考虑以$gamma美元折扣的Markov 决策程序来粗略的动态编程,并将其应用于以线性值功能近似值的近似规划。我们的第一个贡献是“Apbiright Policy Liveration”(API)的新变体,称为“CAPI”,它用有效地平线$和固定政策的行动价值函数最差的近似误差来计算一种确定性的固定政策,用有效地平线 美元和最坏的近似差差差差差($\eplon美元) 。相对于 API(其误差比例为$H%2美元),这种改进是以存储成本中以美元约束值递增的价格(API) 。不像Scherrer和Lesner [2012],它建议计算一种非静止政策,以便实现类似的改进(与同样的内存管理),我们可以坚持固定政策。这让我们的第二次贡献,即应用CAPI来规划本地访问模拟和所有已知的平面线性差值函数。我们设计一种更精确的算算算方法,在不断改进Oqlial的Oral的内产中,而后,在不断改进一个最优的智能政策序列中,在Oral-ral-revalx的算中,在Oxxlationsl-ralslationalmax。