In this paper, we study the problem of learning to satisfy temporal logic specifications with a group of agents in an unknown environment, which may exhibit probabilistic behaviour. From a learning perspective these specifications provide a rich formal language with which to capture tasks or objectives, while from a logic and automated verification perspective the introduction of learning capabilities allows for practical applications in large, stochastic, unknown environments. The existing work in this area is, however, limited. Of the frameworks that consider full linear temporal logic or have correctness guarantees, all methods thus far consider only the case of a single temporal logic specification and a single agent. In order to overcome this limitation, we develop the first multi-agent reinforcement learning technique for temporal logic specifications, which is also novel in its ability to handle multiple specifications. We provide correctness and convergence guarantees for our main algorithm - ALMANAC (Automaton/Logic Multi-Agent Natural Actor-Critic) - even when using function approximation. Alongside our theoretical results, we further demonstrate the applicability of our technique via a set of preliminary experiments.
翻译:在本文中,我们研究了在未知环境中与一组物剂一起学习满足时间逻辑规格的问题,这些规格可能表现出概率行为。从学习角度讲,这些规格提供了一种丰富的正式语言,可以捕捉任务或目标,而从逻辑和自动核查角度讲,引入学习能力允许在大、随机、未知环境中进行实际应用。然而,这一领域的现有工作是有限的。在考虑完全线性时间逻辑或具有正确性保证的框架中,迄今为止所有方法都只考虑单一时间逻辑规格和单一物剂的情况。为了克服这一限制,我们开发了第一个关于时间逻辑规格的多剂强化学习技术,这也对它处理多重规格的能力来说是新颖的。我们为我们的主要算法—— ALMANAC(Automaton/Logic 多重自然作用者-Critic)提供了正确性和趋同性保证,即使使用功能近似性。除了我们的理论结果外,我们还通过一套初步实验进一步展示了我们技术的可适用性。