Linear temporal logic (LTL) is a widely-used task specification language which has a compositional grammar that naturally induces temporally extended behaviours across tasks, including conditionals and alternative realizations. An important problem i RL with LTL tasks is to learn task-conditioned policies which can zero-shot generalize to new LTL instructions not observed in the training. However, because symbolic observation is often lossy and LTL tasks can have long time horizon, previous works can suffer from issues such as training sampling inefficiency and infeasibility or sub-optimality of the found solutions. In order to tackle these issues, this paper proposes a novel multi-task RL algorithm with improved learning efficiency and optimality. To achieve the global optimality of task completion, we propose to learn options dependent on the future subgoals via a novel off-policy approach. In order to propagate the rewards of satisfying future subgoals back more efficiently, we propose to train a multi-step value function conditioned on the subgoal sequence which is updated with Monte Carlo estimates of multi-step discounted returns. In experiments on three different domains, we evaluate the LTL generalization capability of the agent trained by the proposed method, showing its advantage over previous representative methods.
翻译:线性线性逻辑(LTL)是一种广泛使用的任务规格语言,具有一种组成语法,自然地引起跨任务、包括有条件和替代性实现过程的时间延伸行为。一个重要的问题是,LTL任务所在的 i RL 级算法是一个重要问题,是学习任务要求的政策,这种政策可以零发地概括到培训中未观察到的新的LTL指令。然而,由于象征性观察常常丢失,而LTL任务可以有较长的时间跨度,以前的工作可能会受到诸如培训抽样、效率低效率、不可行或所发现解决办法的次最佳等问题的影响。为了解决这些问题,本文件提出了具有更高的学习效率和最佳性的新颖的多任务RL算法。为了实现任务完成的全球最佳性,我们提议通过新的离政策方法学习取决于未来次级目标的选择。为了宣传满足未来次级目标的收益,我们提议培训一个多阶段值函数,以次级目标序列为条件,该子目标以蒙特卡洛对多步骤折扣回报所作的估计加以更新。在三个不同领域的实验中,我们通过经过培训的代理人普遍优势的方法评估其先前的优势。