In many real-world applications of control system and robotics, linear temporal logic (LTL) is a widely-used task specification language which has a compositional grammar that naturally induces temporally extended behaviours across tasks, including conditionals and alternative realizations. An important problem in RL with LTL tasks is to learn task-conditioned policies which can zero-shot generalize to new LTL instructions not observed in the training. However, because symbolic observation is often lossy and LTL tasks can have long time horizon, previous works can suffer from issues such as training sampling inefficiency and infeasibility or sub-optimality of the found solutions. In order to tackle these issues, this paper proposes a novel multi-task RL algorithm with improved learning efficiency and optimality. To achieve the global optimality of task completion, we propose to learn options dependent on the future subgoals via a novel off-policy approach. In order to propagate the rewards of satisfying future subgoals back more efficiently, we propose to train a multi-step value function conditioned on the subgoal sequence which is updated with Monte Carlo estimates of multi-step discounted returns. In experiments on three different domains, we evaluate the LTL generalization capability of the agent trained by the proposed method, showing its advantage over previous representative methods.
翻译:在控制系统和机器人的许多现实应用中,线性时间逻辑(LTL)是一种广泛使用的任务规格语言,其组成语法自然地在各种任务中引起时间上延长的行为,包括有条件的实现和替代性的实现。LTL任务中的一个重要问题是学习具有任务条件的政策,这种政策可以对培训中未观察到的新的LTL指令零射一览。然而,由于象征性的观察往往丢失,而LTL任务可以有较长的时间跨度,以前的工作可能会受到诸如培训抽样抽样、不可行或发现解决方案的次优化等问题的影响。为了解决这些问题,本文件提出了一个新的多任务RL算法,提高了学习效率和最佳性。为了实现任务完成的全球最佳性,我们提议通过新的离政策办法学习取决于未来次级目标的选择。为了宣传满足未来次级目标的奖励,我们提议培训一个多步骤价值函数,以次级目标序列为条件,该子目标以Monte Carlo对多步骤代理能力所作的最新估计为基础,通过我们所培训的多步骤方法对过去三个领域进行试点,我们用所提议的方法来显示其普遍优势。