In traditional robot exploration methods, the robot usually does not have prior biases about the environment it is exploring. Thus the robot assigns equal importance to the goals which leads to insufficient exploration efficiency. Alternative, often a hand-tuned policy is used to tweak the value of goals. In this paper, we present a method to learn how "good" some states are, measured by the state value function, to provide a hint for the robot to make exploration decisions. We propose to learn state value functions from previous offline collected datasets and then transfer and improve the value function during testing in a new environment. Moreover, the environments usually have very few and even no extrinsic reward or feedback for the robot. Therefore in this work, we also tackle the problem of sparse extrinsic rewards from the environments. We design several intrinsic rewards to encourage the robot to obtain more information during exploration. These reward functions then become the building blocks of the state value functions. We test our method on challenging subterranean and urban environments. To the best of our knowledge, this work for the first time demonstrates value function prediction with previous collected datasets to help exploration in challenging subterranean environments.
翻译:在传统的机器人探索方法中,机器人通常对所探索的环境没有先入之见。 因此, 机器人对导致探索效率不足的目标给予同等的重视。 另一种, 通常使用手动调整的政策来调整目标的价值。 在本文中, 我们提出了一个方法来学习如何用国家价值函数来测量某些国家的“ 好” 来为机器人做出勘探决定提供提示。 我们提议从先前的离线收集的数据集中学习状态值函数, 然后在新环境中的测试中传输和改进价值函数。 此外, 环境通常很少甚至没有机器人的外部奖赏或反馈。 因此, 在这项工作中, 我们还解决了环境极端奖赏稀少的问题。 我们设计了一些内在奖赏来鼓励机器人在探索过程中获得更多信息。 这些奖赏功能随后成为了国家价值函数的构件。 我们用挑战亚地和城市环境的方法测试。 为了我们的最佳知识, 这项工作首次展示了用先前收集的数据集来帮助在挑战亚地环境中进行探索的价值预测。