在线、稳定强化学习框架 (A framework for online, stabilizing reinforcement learning)

Online reinforcement learning is concerned with training an agent on-the-fly via dynamic interaction with the environment. Here, due to the specifics of the application, it is not generally possible to perform long pre-training, as it is commonly done in off-line, model-free approaches, which are akin to dynamic programming. Such applications may be found more frequently in industry, rather than in pure digital fields, such as cloud services, video games, database management, etc., where reinforcement learning has been demonstrating success. Online reinforcement learning, in contrast, is more akin to classical control, which utilizes some model knowledge about the environment. Stability of the closed-loop (agent plus the environment) is a major challenge for such online approaches. In this paper, we tackle this problem by a special fusion of online reinforcement learning with elements of classical control, namely, based on the Lyapunov theory of stability. The idea is to start the agent at once, without pre-training, and learn approximately optimal policy under specially designed constraints, which guarantee stability. The resulting approach was tested in an extensive experimental study with a mobile robot. A nominal parking controller was used as a baseline. It was observed that the suggested agent could always successfully park the robot, while significantly improving the cost. While many approaches may be exploited for mobile robot control, we suggest that the experiments showed the promising potential of online reinforcement learning agents based on Lyapunov-like constraints. The presented methodology may be utilized in safety-critical, industrial applications where stability is necessary.

翻译：在线强化学习涉及通过与环境的动态互动培训一名在空中飞行的代理人员。在这里,由于应用程序的具体特点,一般不可能进行长期的预培训,因为通常在离线、无模式的、类似于动态编程的无模式的办法中进行,这种应用在工业中比在纯数字领域(如云服务、视频游戏、数据库管理等)更常见,而强化学习已经证明成功的领域,例如云服务、视频游戏、数据库管理等,这种应用在工业中更常见,而不是在纯数字领域更常见。在线强化学习更接近于传统控制,而传统控制则利用对环境的一些示范知识。闭路运输(代理人加环境)的稳定是这种在线方法的一大挑战。在本文件中,我们通过将在线强化学习与传统控制要素(即基于Lyapunov稳定性理论的经典理论)特别结合来解决这个问题。设想,在没有预先培训的情况下,在特别设计的保证稳定性的制约下,立即启动该工具,并学习一些最佳政策。由此形成的方法在使用一个必要的移动机器人进行广泛的实验研究中测试。一个像样的固定的固定设施控制者被使用,同时提出,在实验室中,我们可以成功地使用。展示了一种潜在的机械控制方法。它可以成功地利用它作为一个基础。它。它作为基础。它作为一个成功的试验工具,它,它可以被利用的一种方法,它作为一个成功的试验。它作为一个成功的试验工具,它作为一个成功的工具,它作为一个成功的试验。它,它被被使用的一种方法,用来用来用来用来用来用来用来用来用来用来用来用来用来用来用来用来用来用来用来作为一种试验。