Reinforcement learning (RL) often requires decomposing a problem into subtasks and composing learned behaviors on these tasks. Compositionality in RL has the potential to create modular subtask units that interface with other system capabilities. However, generating compositional models requires the characterization of minimal assumptions for the robustness of the compositional feature. We develop a framework for a \emph{compositional theory} of RL using a categorical point of view. Given the categorical representation of compositionality, we investigate sufficient conditions under which learning-by-parts results in the same optimal policy as learning on the whole. In particular, our approach introduces a category $\mathsf{MDP}$, whose objects are Markov decision processes (MDPs) acting as models of tasks. We show that $\mathsf{MDP}$ admits natural compositional operations, such as certain fiber products and pushouts. These operations make explicit compositional phenomena in RL and unify existing constructions, such as puncturing hazardous states in composite MDPs and incorporating state-action symmetry. We also model sequential task completion by introducing the language of zig-zag diagrams that is an immediate application of the pushout operation in $\mathsf{MDP}$.
翻译:强化学习( RL) 通常需要将问题分解成子任务, 并形成这些任务上学到的行为。 融合工作的组成有可能创建模块化的子任务单位, 与其他系统能力相连接。 但是, 生成组成模式需要为组成特征的稳健性进行最低假设的定性。 我们用绝对的观点来开发一个 RL 的构建理论框架。 鉴于构成性的绝对代表性, 我们调查各个部分的学习能够产生与整体学习相同的最佳政策的条件。 特别是, 我们的方法引入了一个类别 $\ mathsfsf{ MDP} $, 其目标为 Markov 决策程序( MDPs), 作为任务模式。 我们显示$\ mathsf{MDP} $ 接受自然的构成操作, 如某些纤维产品和推倒。 这些操作在构建工作中产生明确的构成现象, 并统一现有的构造, 例如将危险状态定位在复合 MDPs 中, 并纳入州- 度对 MAD 的系统测量 。 我们还在连续任务应用中引入一个模型, 通过完成 zmag 。