可核实和构成强化学习系统 (Verifiable and Compositional Reinforcement Learning Systems)

from arxiv, Accepted for publication at ICAPS 2022. Changes since v1: An additional example with continuous states and actions has been added. Additional discussion has been added to the section presenting the algorithm, and to the section presenting the numerical experiments. Wording and formatting has been edited for consistency with the version published by AAAI press

We propose a framework for verifiable and compositional reinforcement learning (RL) in which a collection of RL subsystems, each of which learns to accomplish a separate subtask, are composed to achieve an overall task. The framework consists of a high-level model, represented as a parametric Markov decision process (pMDP) which is used to plan and to analyze compositions of subsystems, and of the collection of low-level subsystems themselves. By defining interfaces between the subsystems, the framework enables automatic decompositions of task specifications, e.g., reach a target set of states with a probability of at least 0.95, into individual subtask specifications, i.e. achieve the subsystem's exit conditions with at least some minimum probability, given that its entry conditions are met. This in turn allows for the independent training and testing of the subsystems; if they each learn a policy satisfying the appropriate subtask specification, then their composition is guaranteed to satisfy the overall task specification. Conversely, if the subtask specifications cannot all be satisfied by the learned policies, we present a method, formulated as the problem of finding an optimal set of parameters in the pMDP, to automatically update the subtask specifications to account for the observed shortcomings. The result is an iterative procedure for defining subtask specifications, and for training the subsystems to meet them. As an additional benefit, this procedure allows for particularly challenging or important components of an overall task to be determined automatically, and focused on, during training. Experimental results demonstrate the presented framework's novel capabilities.

翻译：我们提出一个可核查和增强构成的学习框架(RL),在这个框架中,将每个子系统都学会完成一个单独的子任务,每个子系统都能够自动分解成一组目标,以完成一个单独的子任务。这个框架包括一个高级别模型,作为用于规划和分析子系统构成的参数Markov决策程序(pMDP),以及收集低层次子系统本身。通过界定子系统之间的界面,这个框架可以自动分解任务规格,例如,达到一套可能至少达到0.95个具体子任务规格的国家,即,实现子系统的退出条件,至少达到某种最低的可能性,因为其进入条件得到满足。这反过来又允许对子系统进行独立的培训和测试;如果每个子任务学习一项符合适当的子任务规格的政策,那么它们的组成就能够保证满足整个任务规格。相反,如果子任务规格无法全部满足所学习的政策,我们提出一种方法,作为找到一个精细的精细的精细度参数的精细度参数的精细度框架,即至少达到某些最低的可能性。