在最佳国家价值功能的线性可实现性下,在 MDP 中查询高效规划 (On Query-efficient Planning in MDPs under Linear Realizability of the Optimal State-value Function)

We consider the problem of local planning in fixed-horizon Markov Decision Processes (MDPs) with a generative model under the assumption that the optimal value function lies in the span of a feature map that is accessible through the generative model. As opposed to previous work where linear realizability of all policies was assumed, we consider the significantly relaxed assumption of a single linearly realizable (deterministic) policy. A recent lower bound established that the related problem when the action-value function of the optimal policy is linearly realizable requires an exponential number of queries, either in H (the horizon of the MDP) or d (the dimension of the feature mapping). Their construction crucially relies on having an exponentially large action set. In contrast, in this work, we establish that poly$(H, d)$ learning is possible (with state value function realizability) whenever the action set is small (i.e. O(1)). In particular, we present the TensorPlan algorithm which uses poly$((dH/\delta)^A)$ queries to find a $\delta$-optimal policy relative to any deterministic policy for which the value function is linearly realizable with a parameter from a fixed radius ball around zero. This is the first algorithm to give a polynomial query complexity guarantee using only linear-realizability of a single competing value function. Whether the computation cost is similarly bounded remains an interesting open question. The upper bound is complemented by a lower bound which proves that in the infinite-horizon episodic setting, planners that achieve constant suboptimality need exponentially many queries, either in the dimension or the number of actions.

翻译：我们考虑的是固定和顺向 Markov 决策进程(MDPs) 的本地规划问题, 假设最佳值功能在于通过基因模型可以访问的地貌图。相对于先前假设所有政策的线性可实现性的工作, 我们考虑的是单线性可实现( 确定性) 政策的假设大为宽松。最近一个较低约束确定, 当最佳政策的行动值函数线性地可实现时, 最优政策的行动值功能需要数量惊人的查询, 要么是H( MDP 的视野), 要么是d( 地貌图的尺寸) 。最优化值的计算主要取决于具有指数性大动作设置。相反, 在这项工作中, 我们确定聚( H) 、 d) $ 和单线性( 确定性) 单线性( 确定性) 政策。我们提出Tensorplan 算法, 仅使用聚度( dH/ dedelta) 或 d( 地标) 地标值的查询, 要从 $- dedeltaimalalalal 政策中找到一个直线性值的值, 直线性精度值, 的精确值是使用一个固定的精确度值的精确度值值值, 的精确度值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值, 的精确值值为一个在任何精确度值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值的值值值值值值的值的值值值值值值值值值值值值值的直值的直为在任何的直值的直值值的直值值值值值值值值值值值值值值值值值值值值值值值的直值的直值值值值值值值值值值值值的直值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值值的值的