Recent advances in the reinforcement learning (RL) literature have enabled roboticists to automatically train complex policies in simulated environments. However, due to the poor sample complexity of these methods, solving RL problems using real-world data remains a challenging problem. This paper introduces a novel cost-shaping method which aims to reduce the number of samples needed to learn a stabilizing controller. The method adds a term involving a Control Lyapunov Function (CLF) -- an `energy-like' function from the model-based control literature -- to typical cost formulations. Theoretical results demonstrate the new costs lead to stabilizing controllers when smaller discount factors are used, which is well-known to reduce sample complexity. Moreover, the addition of the CLF term `robustifies' the search for a stabilizing controller by ensuring that even highly sub-optimal polices will stabilize the system. We demonstrate our approach with two hardware examples where we learn stabilizing controllers for a cartpole and an A1 quadruped with only seconds and a few minutes of fine-tuning data, respectively. Furthermore, simulation benchmark studies show that obtaining stabilizing policies by optimizing our proposed costs requires orders of magnitude less data compared to standard cost designs.
翻译:强化学习文献(RL)最近的进展使机器人学家能够在模拟环境中自动地培训复杂的政策,然而,由于这些方法的样本复杂性低,使用真实世界数据解决RL问题仍然是一个棘手的问题。本文采用了一种新的成本配置方法,旨在减少学习稳定控制器所需的样本数量。该方法增加了一个术语,涉及控制 Lyapunov 函数(一种基于模型的控制文献的“类似能源”功能),以至典型的成本配方。理论结果显示,在使用较小的折扣系数时,新的成本导致稳定控制器的稳定,这是众所周知的,可以降低抽样复杂性。此外,CLF术语的添加“破坏”使得稳定控制器的搜索,确保即使是高度次优化的警察也能稳定系统。我们用两个硬件示例展示了我们的方法,即我们学习一个马波拉的稳定控制器,一个A1级的“类似”功能,分别只有几秒钟和几分钟的微调数据。此外,模拟基准研究表明,通过优化我们的拟议成本,实现稳定政策需要比标准低级的订单。