We propose a new stochastic primal-dual optimization algorithm for planning in a large discounted Markov decision process with a generative model and linear function approximation. Assuming that the feature map approximately satisfies standard realizability and Bellman-closedness conditions and also that the feature vectors of all state-action pairs are representable as convex combinations of a small core set of state-action pairs, we show that our method outputs a near-optimal policy after a polynomial number of queries to the generative model. Our method is computationally efficient and comes with the major advantage that it outputs a single softmax policy that is compactly represented by a low-dimensional parameter vector, and does not need to execute computationally expensive local planning subroutines in runtime.
翻译:我们提出一个新的随机原始优化算法,用于在大规模折扣的Markov决策程序中进行规划,配有基因模型和线性函数近似值。假设地貌图大致符合标准的可变性和贝尔曼封闭性条件,而且所有州-州对子的特性矢量作为一小组州-州-行动对子核心组合的共振组合,我们显示,我们的方法在对基因模型进行多端查询后产生了一种近乎最佳的政策。我们的方法是计算效率高,并具有主要优势,即它产生一种由低维参数矢量组成的单一软式软体政策,不需要在运行时执行计算昂贵的地方规划子路程。