Compositional reinforcement learning is a promising approach for training policies to perform complex long-horizon tasks. Typically, a high-level task is decomposed into a sequence of subtasks and a separate policy is trained to perform each subtask. In this paper, we focus on the problem of training subtask policies in a way that they can be used to perform any task; here, a task is given by a sequence of subtasks. We aim to maximize the worst-case performance over all tasks as opposed to the average-case performance. We formulate the problem as a two agent zero-sum game in which the adversary picks the sequence of subtasks. We propose two RL algorithms to solve this game: one is an adaptation of existing multi-agent RL algorithms to our setting and the other is an asynchronous version which enables parallel training of subtask policies. We evaluate our approach on two multi-task environments with continuous states and actions and demonstrate that our algorithms outperform state-of-the-art baselines.
翻译:强化的构成学习是一种有希望的方法,用于培训政策,以完成复杂的长方位任务。通常,高级任务将分解成一个子任务序列,并培训一个单独的政策来完成每个子任务。在本文件中,我们侧重于培训子任务政策的问题,以便使用它们来完成任何任务;这里,任务由子任务序列来完成。我们的目标是在所有任务中最大限度地实现最坏的绩效,而不是平均的绩效。我们将问题表述为两个代理零和游戏,对手在其中选择子任务序列。我们建议用两种RL算法来解决这个游戏:一种是对现有多试剂RL算法的调整以适应我们的设置,另一种是非同步的版本,以便能够平行地培训子任务政策。我们用连续的状态和行动来评估我们在两个多任务环境中的做法,并证明我们的算法超越了最先进的基线。