We present a midpoint policy iteration algorithm to solve linear quadratic optimal control problems in both model-based and model-free settings. The algorithm is a variation of Newton's method, and we show that in the model-based setting it achieves cubic convergence, which is superior to standard policy iteration and policy gradient algorithms that achieve quadratic and linear convergence, respectively. We also demonstrate that the algorithm can be approximately implemented without knowledge of the dynamics model by using least-squares estimates of the state-action value function from trajectory data, from which policy improvements can be obtained. With sufficient trajectory data, the policy iterates converge cubically to approximately optimal policies, and this occurs with the same available sample budget as the approximate standard policy iteration. Numerical experiments demonstrate effectiveness of the proposed algorithms.
翻译:我们提出了一个中点政策迭代算法,以解决模型和无模型环境中的线性二次最佳控制问题。算法是牛顿方法的变异。我们表明,在基于模型的环境下,它实现了立方趋同,这优于标准的政策迭代法和政策梯度算法,分别实现了二次和线性趋同。我们还表明,在不了解动态模型的情况下,通过使用从轨迹数据中得出的国家-行动值最低方位估计值来大致可以实施算法,从中可以取得政策改进。有了足够的轨迹数据,该保单循环政策会与大约最佳的政策相交汇,这与现有的与大致的标准政策迭代法相同的抽样预算发生。 数值实验证明了拟议算法的有效性。