The framework of Simulation-to-real learning, i.e, learning policies in simulation and transferring those policies to the real world is one of the most promising approaches towards data-efficient learning in robotics. However, due to the inevitable reality gap between the simulation and the real world, a policy learned in the simulation may not always generate a safe behaviour on the real robot. As a result, during adaptation of the policy in the real world, the robot may damage itself or cause harm to its surroundings. In this work, we introduce a novel learning algorithm called SafeAPT that leverages a diverse repertoire of policies evolved in the simulation and transfers the most promising safe policy to the real robot through episodic interaction. To achieve this, SafeAPT iteratively learns a probabilistic reward model as well as a safety model using real-world observations combined with simulated experiences as priors. Then, it performs Bayesian optimization on the repertoire with the reward model while maintaining the specified safety constraint using the safety model. SafeAPT allows a robot to adapt to a wide range of goals safely with the same repertoire of policies evolved in the simulation. We compare SafeAPT with several baselines, both in simulated and real robotic experiments and show that SafeAPT finds high-performance policies within a few minutes in the real world while minimizing safety violations during the interactions.
翻译:模拟到现实学习的框架,即模拟中的学习政策,以及将这些政策转移到现实世界,是最有希望的机器人数据效率学习方法之一;然而,由于模拟与现实世界之间不可避免的现实差距,模拟中学习的政策不一定总能在真正的机器人上产生安全的行为。因此,在现实世界对政策进行调整期间,机器人可能损害自己或对其周围环境造成损害。在这项工作中,我们引入了一种新型学习算法,称为“SafeAPT”,该算法利用模拟中演进的各种政策,并通过偶发互动将最有希望的安全政策转移到真正的机器人。为了实现这一点,SafeAPT反复学习一种概率性奖励模式以及一种使用真实世界观察和模拟经验的安全模式。随后,它利用安全模型对奖赏模型进行巴耶斯式的优化,同时保持特定的安全限制。安全APT允许一个机器人安全地适应一系列广泛的目标,同时将一些最有希望的安全政策转移到真实的机器人身上。为了实现这一点,SafeAPT在模拟世界的模拟中,我们用一个真实的模拟模型中找到一个安全性标准,然后在模拟中进行安全性测试中进行安全性测试。