We address the problem of teaching a deep reinforcement learning (RL) agent to follow instructions in multi-task environments. The combinatorial task sets we target consist of up to $~10^{39}$ unique tasks. We employ a well-known formal language -- linear temporal logic (LTL) -- to specify instructions, using a domain-specific vocabulary. We propose a novel approach to learning that exploits the compositional syntax and the semantics of LTL, enabling our RL agent to learn task-conditioned policies that generalize to new instructions, not observed during training. The expressive power of LTL supports the specification of a diversity of complex temporally extended behaviours that include conditionals and alternative realizations. To reduce the overhead of learning LTL semantics, we introduce an environment-agnostic LTL pretraining scheme which improves sample-efficiency in downstream environments. Experiments on discrete and continuous domains demonstrate the strength of our approach in learning to solve (unseen) tasks, given LTL instructions.
翻译:我们解决了在多任务环境中教授深强化学习(RL)代理机构以遵守多任务环境中的指示的问题。我们的目标组合任务组由高达$~10 ⁇ 39美元的独特任务组成。我们使用一种众所周知的正式语言 -- -- 线性时间逻辑(LTL) -- -- 来指定指示,使用特定的域名词汇。我们提出一种新的学习方法,利用LT的合成语法和语义学,使我们的RL代理机构能够学习具有任务条件的政策,这种政策一般化为新的指示,而培训期间没有观察到。LTL的表达力支持对复杂、时间延长的行为的规格,其中包括有条件的和替代性的实现。为了减少学习LTLT语义学的间接费用,我们引入了一种环境上不可知性的LT培训前计划,以提高下游环境的样本效率。根据LTL的指示,在离和连续的领域进行实验,显示了我们学习(不见)任务解决的方法的力度。