Learning optimal control policies directly on physical systems is challenging since even a single failure can lead to costly hardware damage. Most existing learning methods that guarantee safety, i.e., no failures, during exploration are limited to local optima. A notable exception is the GoSafe algorithm, which, unfortunately, cannot handle high-dimensional systems and hence cannot be applied to most real-world dynamical systems. This work proposes GoSafeOpt as the first algorithm that can safely discover globally optimal policies for complex systems while giving safety and optimality guarantees. Our experiments on a robot arm that would be prohibitive for GoSafe demonstrate that GoSafeOpt safely finds remarkably better policies than competing safe learning methods for high-dimensional domains.
翻译:直接在物理系统上学习最佳控制政策具有挑战性,因为即使是单一失败都可能导致昂贵的硬件损坏。在勘探期间,大多数现有的保证安全的学习方法,即没有失败,都仅限于局部选择。一个显著的例外是GoSafe算法,不幸的是,它无法处理高维系统,因此无法应用于大多数现实世界的动态系统。这项工作提出GoSafeOpt是第一个能够安全地发现全球对复杂系统的最佳政策,同时提供安全和最佳性保证的算法。我们在机器人臂上的实验对GoSafe来说是令人望而却步的。 我们的实验表明,GoSafeOpt安全地找到了比高维域的安全学习方法更好的政策。