Traditional imitation learning provides a set of methods and algorithms to learn a reward function or policy from expert demonstrations. Learning from demonstration has been shown to be advantageous for navigation tasks as it allows for machine learning non-experts to quickly provide information needed to learn complex traversal behaviors. However, a minimal set of demonstrations is unlikely to capture all relevant information needed to achieve the desired behavior in every possible future operational environment. Due to distributional shift among environments, a robot may encounter features that were rarely or never observed during training for which the appropriate reward value is uncertain, leading to undesired outcomes. This paper proposes a Bayesian technique which quantifies uncertainty over the weights of a linear reward function given a dataset of minimal human demonstrations to operate safely in dynamic environments. This uncertainty is quantified and incorporated into a risk averse set of weights used to generate cost maps for planning. Experiments in a 3-D environment with a simulated robot show that our proposed algorithm enables a robot to avoid dangerous terrain completely in two out of three test scenarios and accumulates a lower amount of risk than related approaches in all scenarios without requiring any additional demonstrations.
翻译:传统模仿学习提供了一套方法和算法,以从专家演示中学习奖励功能或政策。从示范中学习,已证明对导航任务有利,因为通过示范学习,机器学习非专家能够迅速提供学习复杂的跨度行为所需的信息;然而,一套最低限度的示范不可能捕捉到在每一个可能的未来作业环境中实现理想行为所需的所有相关信息。由于环境之间的分布变化,机器人可能遇到在培训期间很少或从未观察到的特征,而适当的奖励价值不确定,导致不理想的结果。本文建议采用一种巴伊西亚技术,在最低限度的人类演示数据集下,可以安全地在动态环境中运作,从而测量线性奖励功能的重量的不确定性。这种不确定性是量化的,并纳入用于绘制规划成本图的一组风险。在三维环境中与模拟机器人进行的实验表明,我们提议的算法使机器人能够在三种测试情景中完全避免危险地形,并在所有情景中积累比相关方法低的风险程度,而无需任何额外的演示。