This paper studies the robustness of reinforcement learning algorithms to errors in the learning process. Specifically, we revisit the benchmark problem of discrete-time linear quadratic regulation (LQR) and study the long-standing open question: Under what conditions is the policy iteration method robustly stable from a dynamical systems perspective? Using advanced stability results in control theory, it is shown that policy iteration for LQR is inherently robust to small errors in the learning process and enjoys small-disturbance input-to-state stability: whenever the error in each iteration is bounded and small, the solutions of the policy iteration algorithm are also bounded, and, moreover, enter and stay in a small neighbourhood of the optimal LQR solution. As an application, a novel off-policy optimistic least-squares policy iteration for the LQR problem is proposed, when the system dynamics are subjected to additive stochastic disturbances. The proposed new results in robust reinforcement learning are validated by a numerical example.
翻译:本文研究强化学习算法对于学习过程中错误的稳健性。 具体地说, 我们重新审视离子时间线性二次调节的基准问题, 并研究长期存在的未决问题: 从动态系统的角度来看,政策迭代方法在什么条件下是稳健稳定的? 在控制理论中, 使用先进的稳定性结果, 证明LQR的政策迭代本质上对于学习过程中的小错误是稳健的, 并且具有小阻力输入到国家的稳定性: 当每次迭代的错误被捆绑和小, 政策迭代算法的解决方案也被捆绑在一起, 并且进入和停留在最佳LQR解决方案的狭小邻里。 作为应用, 提出了一种新的非政策性乐观的最小方位政策迭代法, 当系统动态受到添加的随机扰动时, 当系统动态受到扰动时, 拟议的强力强化学习新结果会得到一个数字示例的验证 。