Recent advances in the reinforcement learning (RL) literature have enabled roboticists to automatically train complex policies in simulated environments. However, due to the poor sample complexity of these methods, solving reinforcement learning problems using real-world data remains a challenging problem. This paper introduces a novel cost-shaping method which aims to reduce the number of samples needed to learn a stabilizing controller. The method adds a term involving a control Lyapunov function (CLF) -- an `energy-like' function from the model-based control literature -- to typical cost formulations. Theoretical results demonstrate the new costs lead to stabilizing controllers when smaller discount factors are used, which is well-known to reduce sample complexity. Moreover, the addition of the CLF term `robustifies' the search for a stabilizing controller by ensuring that even highly sub-optimal polices will stabilize the system. We demonstrate our approach with two hardware examples where we learn stabilizing controllers for a cartpole and an A1 quadruped with only seconds and a few minutes of fine-tuning data, respectively.
翻译:强化学习文献(RL)最近的进展使得机器人学家能够在模拟环境中自动地培训复杂的政策,然而,由于这些方法的抽样复杂性低,使用现实世界数据解决强化学习问题仍然是一个棘手的问题。本文采用了一种新的成本配置方法,旨在减少学习稳定控制器所需的样本数量。这种方法增加了一个术语,涉及控制 Lyapunov 函数(一种基于模型的控制文献中的“能源类”函数),以至典型的成本配置。理论结果表明,在使用较小的折扣系数时,新的成本会导致稳定控制器,而使用较小的折扣系数是众所周知的,可以降低样本复杂性。此外,CLF术语的添加是“破坏”稳定控制器的搜索,确保即使是高度次优化的警察也能稳定系统。我们用两个硬件实例展示了我们的方法,我们用两个实例分别学习了马波和A1的稳定控制器,用几秒钟和几分钟微调数据来学习稳定控制器。