We present Bayesian Controller Fusion (BCF): a hybrid control strategy that combines the strengths of traditional hand-crafted controllers and model-free deep reinforcement learning (RL). BCF thrives in the robotics domain, where reliable but suboptimal control priors exist for many tasks, but RL from scratch remains unsafe and data-inefficient. By fusing uncertainty-aware distributional outputs from each system, BCF arbitrates control between them, exploiting their respective strengths. We study BCF on two real-world robotics tasks involving navigation in a vast and long-horizon environment, and a complex reaching task that involves manipulability maximisation. For both these domains, there exist simple handcrafted controllers that can solve the task at hand in a risk-averse manner but do not necessarily exhibit the optimal solution given limitations in analytical modelling, controller miscalibration and task variation. As exploration is naturally guided by the prior in the early stages of training, BCF accelerates learning, while substantially improving beyond the performance of the control prior, as the policy gains more experience. More importantly, given the risk-aversity of the control prior, BCF ensures safe exploration and deployment, where the control prior naturally dominates the action distribution in states unknown to the policy. We additionally show BCF's applicability to the zero-shot sim-to-real setting and its ability to deal with out-of-distribution states in the real-world. BCF is a promising approach for combining the complementary strengths of deep RL and traditional robotic control, surpassing what either can achieve independently. The code and supplementary video material are made publicly available at https://krishanrana.github.io/bcf.
翻译:我们介绍了巴伊西亚主计长Fusion(BCF):一种混合控制战略,将传统手工操作控制器和无模型深度强化学习(RL)的优势结合起来。BCF在机器人领域蓬勃发展,因为许多任务都存在可靠但次优的控制前程,但从零开始的RL仍然不安全,数据效率低下。通过利用每个系统的不确定性分布输出,BCF在它们之间进行控制,利用各自的优势。我们研究BCF在两个真实世界的机器人任务上,涉及在广阔和长视距环境中进行导航,以及一项复杂的达标任务,涉及调控。对于这两个领域来说,BCFCF都存在简单的手工操作控制器,可以以风险反向风险方式解决任务,但鉴于分析模型、控制错乱和任务变异的局限性,RCFCF在使用不确定的最初阶段,BCF加速学习,同时大大改进了先前的控制功能,因为政策经验越发丰富了。更重要的是,由于在风险-稳定性稳定下,Blationalalal-laevalalalalal dalalal lady lady lady lady to the lady to lady to the lax to lax to lax to the lax to lax to lax to lady detrading the lady detraction to lady dre dre dre dre dir dirdald Stre dirdaldaldaldaldaldalction to.