在最佳国家价值功能的线性可实现性下,在 MDP 中查询高效规划 (On Query-efficient Planning in MDPs under Linear Realizability of the Optimal State-value Function)

We consider local planning in fixed-horizon MDPs with a generative model under the assumption that the optimal value function lies close to the span of a feature map. The generative model provides a local access to the MDP: The planner can ask for random transitions from previously returned states and arbitrary actions, and features are only accessible for states that are encountered in this process. As opposed to previous work (e.g. Lattimore et al. (2020)) where linear realizability of all policies was assumed, we consider the significantly relaxed assumption of a single linearly realizable (deterministic) policy. A recent lower bound by Weisz et al. (2020) established that the related problem when the action-value function of the optimal policy is linearly realizable requires an exponential number of queries, either in $H$ (the horizon of the MDP) or $d$ (the dimension of the feature mapping). Their construction crucially relies on having an exponentially large action set. In contrast, in this work, we establish that poly$(H,d)$ planning is possible with state value function realizability whenever the action set has a constant size. In particular, we present the TensorPlan algorithm which uses poly$((dH/\delta)^A)$ simulator queries to find a $\delta$-optimal policy relative to any deterministic policy for which the value function is linearly realizable with some bounded parameter. This is the first algorithm to give a polynomial query complexity guarantee using only linear-realizability of a single competing value function. Whether the computation cost is similarly bounded remains an open question. We extend the upper bound to the near-realizable case and to the infinite-horizon discounted setup. We also present a lower bound in the infinite-horizon episodic setting: Planners that achieve constant suboptimality need exponentially many queries, either in $d$ or the number of actions.

翻译：我们考虑在固定偏顺 MDP 中进行本地规划, 并使用一个归正模型, 假设最佳值功能接近特性地图的范围。归正模型提供本地访问 MDP : 计划者可以要求从先前返回的状态进行随机过渡和任意行动, 特性只对在此过程中遇到的状态开放。相对于先前的工作( 例如 Lattimore 等人( 202020) ), 假设所有政策的线性真实性, 我们考虑对单一直线直线变现( 确定性) 的假设非常宽松。最近由 Weisz 等人( 202020) 设定了一个较低的假设。当最佳政策的动作- 值从线性向直线性转换时, 相关的问题需要指数数量, $( MPD) 或美元( 特性绘图的尺寸) 。它们的构建关键取决于是否具有指数化的大动作设置。与此形成对比的是, 我们确定, 将多价( H, d) 计划有可能以国家值函数为实值。当动作设定一个直线性变正值时, 。。直线性政策的直线性函数为正值为正值。