Model-free reinforcement learning (RL) is a powerful tool to learn a broad range of robot skills and policies. However, a lack of policy interpretability can inhibit their successful deployment in downstream applications, particularly when differences in environmental conditions may result in unpredictable behaviour or generalisation failures. As a result, there has been a growing emphasis in machine learning around the inclusion of stronger inductive biases in models to improve generalisation. This paper proposes an alternative strategy, inverse value estimation for interpretable policy certificates (IV-Posterior), which seeks to identify the inductive biases or idealised conditions of operation already held by pre-trained policies, and then use this information to guide their deployment. IV-Posterior uses MaskedAutoregressive Flows to fit distributions over the set of conditions or environmental parameters in which a policy is likely to be effective. This distribution can then be used as a policy certificate in downstream applications. We illustrate the use of IV-Posterior across a two environments, and show that substantial performance gains can be obtained when policy selection incorporates knowledge of the inductive biases that these policies hold.
翻译:无模型强化学习(RL)是学习广泛的机器人技能和政策的有力工具,然而,缺乏政策解释性会抑制其在下游应用中的成功应用,特别是当环境条件的差异可能导致不可预测的行为或笼统的失败时,结果,机器学习越来越强调在模型中纳入更强烈的感应偏差以改进概括化。本文提出了替代战略,对可解释的政策证书(IV-Poseorior)进行反值估计,力求查明预先培训的政策已经持有的诱导偏差或理想化的操作条件,然后利用这一信息指导其部署。IV-Poceive使用蒙面递增流动,以适应政策可能有效的一系列条件或环境参数的分布,然后将这种分布用作下游应用的政策证书。我们举例说明了在两种环境中使用IV-Posiori的使用情况,并表明在政策选择纳入这些政策所持有的暗示偏差知识时,可以取得重大的业绩收益。