稳定没有Lyapunov功能的加强学习 (On stabilizing reinforcement learning without Lyapunov functions)

Reinforcement learning remains one of the major directions of the contemporary development of control engineering and machine learning. Nice intuition, flexible settings, ease of application are among the many perks of this methodology. From the standpoint of machine learning, the main strength of a reinforcement learning agent is its ability to "capture" (learn) the optimal behavior in the given environment. Typically, the agent is built on neural networks and it is their approximation abilities that give rise to the above belief. From the standpoint of control engineering, however, reinforcement learning has serious deficiencies. The most significant one is the lack of stability guarantee of the agent-environment closed loop. A great deal of research was and is being made towards stabilizing reinforcement learning. Speaking of stability, the celebrated Lyapunov theory is the de facto tool. It is thus no wonder that so many techniques of stabilizing reinforcement learning rely on the Lyapunov theory in one way or another. In control theory, there is an intricate connection between a stabilizing controller and a Lyapunov function. Employing such a pair seems thus quite attractive to design stabilizing reinforcement learning. However, computation of a Lyapunov function is generally a cumbersome process. In this note, we show how to construct a stabilizing reinforcement learning agent that does not employ such a function at all. We only assume that a Lyapunov function exists, which is a natural thing to do if the given system (read: environment) is stabilizable, but we do not need to compute one.

翻译：强化学习仍然是当代控制工程和机器学习发展的主要方向之一。良好的直觉, 灵活的环境, 应用的便利性是这一方法的许多好处之一。从机器学习的角度来看, 强化学习剂的主要力量在于“ 抓取” (learn) 给定环境中的最佳行为。一般来说, 代理器建在神经网络上, 是它们的近似能力导致上述信念。但是, 从控制工程的角度来看, 强化学习有严重的缺陷。最重要的是, 代理器- 环境闭环缺乏稳定性保障。大量研究已经进行, 并且正在为稳定强化学习进行大量研究。说到稳定, 庆祝的 Lyapunov 理论是事实上的工具。因此, 难怪许多稳定强化学习的技术都以某种方式依赖Lyapunov 理论。在控制理论中, 稳定控制控制控制控制器和 Lyapunov 功能之间有着复杂的联系。使用这种组合似乎具有相当的吸引力来设计稳定强化学习。然而, 计算Lyapunov 函数的计算并不是一种累赘的功能。。在这种自然学中, 我们的设置。。