We study query and computationally efficient planning algorithms with linear function approximation and a simulator. We assume that the agent only has local access to the simulator, meaning that the agent can only query the simulator at states that have been visited before. This setting is more practical than many prior works on reinforcement learning with a generative model. We propose an algorithm named confident Monte Carlo least square policy iteration (Confident MC-LSPI) for this setting. Under the assumption that the Q-functions of all deterministic policies are linear in known features of the state-action pairs, we show that our algorithm has polynomial query and computational complexities in the dimension of the features, the effective planning horizon and the targeted sub-optimality, while these complexities are independent of the size of the state space. One technical contribution of our work is the introduction of a novel proof technique that makes use of a virtual policy iteration algorithm. We use this method to leverage existing results on $\ell_\infty$-bounded approximate policy iteration to show that our algorithm can learn the optimal policy for the given initial state even only with local access to the simulator. We believe that this technique can be extended to broader settings beyond this work.
翻译:我们用线性函数近似值和模拟器来研究和计算高效的规划算法。 我们假设代理商只能在当地访问模拟器, 也就是说代理商只能在以前访问过的各州查询模拟器。 这个设置比以前用基因模型进行的许多强化学习工作更实际。 我们为此设置了一个名为“ 自信的蒙特卡洛最小方位政策迭代” 的算法( 自信的 MC- LSPI) 。 根据所有确定性政策的Q函数都是已知的州- 行动对子特征的线性假设, 我们显示我们的算法在特征、 有效规划前景和目标次优化方面有多元的查询和计算复杂性, 而这些复杂性与州空间的大小无关。 我们工作的一个技术贡献是引入一种新颖的证明技术, 从而使用虚拟政策代算法。 我们使用这种方法来利用所有确定性政策的Q\ ell\ intimated point practical 政策的现有结果来显示, 我们的算法在功能、 有效规划视野和目标次优化的亚性方面, 显示我们的算法可以超越最初的工程环境, 我们只能相信, 更广义的本地访问技术。