Bridging model-based safety and model-free reinforcement learning (RL) for dynamic robots is appealing since model-based methods are able to provide formal safety guarantees, while RL-based methods are able to exploit the robot agility by learning from the full-order system dynamics. However, current approaches to tackle this problem are mostly restricted to simple systems. In this paper, we propose a new method to combine model-based safety with model-free reinforcement learning by explicitly finding a low-dimensional model of the system controlled by a RL policy and applying stability and safety guarantees on that simple model. We use a complex bipedal robot Cassie, which is a high dimensional nonlinear system with hybrid dynamics and underactuation, and its RL-based walking controller as an example. We show that a low-dimensional dynamical model is sufficient to capture the dynamics of the closed-loop system. We demonstrate that this model is linear, asymptotically stable, and is decoupled across control input in all dimensions. We further exemplify that such linearity exists even when using different RL control policies. Such results point out an interesting direction to understand the relationship between RL and optimal control: whether RL tends to linearize the nonlinear system during training in some cases. Furthermore, we illustrate that the found linear model is able to provide guarantees by safety-critical optimal control framework, e.g., Model Predictive Control with Control Barrier Functions, on an example of autonomous navigation using Cassie while taking advantage of the agility provided by the RL-based controller.
翻译:由于基于模型的方法能够提供正式的安全保障,而基于RL的方法能够利用机器人的灵活度,从全顺序系统的动态中学习。然而,目前解决这一问题的方法大多限于简单的系统。在本文件中,我们提出一种新的方法,将基于模型的安全性和无模型的强化学习结合起来,方法是明确找到受RL政策控制的系统低维模式,并在这一简单模型上应用稳定性和安全保障。我们使用复杂的双向机器人凯西,这是一个高度的双向非线性非线性系统,具有混合动态和低活性,而其基于RL的行走控制器则是一个例子。我们表明,低维度的动态模式足以捕捉取闭路系统动态。我们证明,这一模式是线性的,具有动态稳定性,而且在所有层面都通过基于控制的投入中进行分解。我们进一步举例说明,即使在采用不同的RL控制政策时,这种双向的双向性机器人凯西(这是一个高度的不线性非线性系统)系统,而这种结果显示,在最优化的RL中,我们找到了一个最优化的直线性趋势,我们又能够理解一个最优化的RL案例。