We study query and computationally efficient planning algorithms with linear function approximation and a simulator. We assume that the agent only has local access to the simulator, meaning that the agent can only query the simulator at states that have been visited before. This setting is more practical than many prior works on reinforcement learning with a generative model. We propose two algorithms, named confident Monte Carlo least square policy iteration (Confident MC-LSPI) and confident Monte Carlo Politex (Confident MC-Politex) for this setting. Under the assumption that the Q-functions of all policies are linear in known features of the state-action pairs, we show that our algorithms have polynomial query and computational costs in the dimension of the features, the effective planning horizon, and the targeted sub-optimality, while these costs are independent of the size of the state space. One technical contribution of our work is the introduction of a novel proof technique that makes use of a virtual policy iteration algorithm. We use this method to leverage existing results on $\ell_\infty$-bounded approximate policy iteration to show that our algorithm can learn the optimal policy for the given initial state even only with local access to the simulator. We believe that this technique can be extended to broader settings beyond this work.
翻译:我们用线性函数近似值和模拟器来研究和计算高效的规划算法。 我们假设代理商只能在当地访问模拟器, 也就是说代理商只能查询以前访问过的各州的模拟器。 这个设置比以前用基因模型进行的许多强化学习工作更实际。 我们建议使用两个算法, 称为自信的蒙特卡洛最小平方政策迭代( 自信的MC- LSPI) 和自信的蒙特卡洛· 波利特克斯( 自信的 MC- Politex) 来进行这一设置。 根据所有政策的Q功能在已知的国家行动配对特征中的线性假设, 我们用这种方法来利用现有的 $\ ell- intimplemental 查询和计算成本, 在特性、 有效规划前景和有针对性的次优化方面, 我们的算法比许多以前的工作更实用。 我们用这个方法来利用现有的 $\ { { { { } =infex 来计算现有结果, 我们的计算方法只能用这个最广泛的本地访问策略 来学习这个技术。