We introduce the partially observable history process (POHP) formalism for reinforcement learning. POHP centers around the actions and observations of a single agent and abstracts away the presence of other players without reducing them to stochastic processes. Our formalism provides a streamlined interface for designing algorithms that defy categorization as exclusively single or multi-agent, and for developing theory that applies across these domains. We show how the POHP formalism unifies traditional models including the Markov decision process, the Markov game, the extensive-form game, and their partially observable extensions, without introducing burdensome technical machinery or violating the philosophical underpinnings of reinforcement learning. We illustrate the utility of our formalism by concisely exploring observable sequential rationality, re-deriving the extensive-form regret minimization (EFR) algorithm, and examining EFR's theoretical properties in greater generality.
翻译:我们引入了部分可观察的历史过程(POHP) 强化学习的形式主义。 POHP中心围绕一个单一代理人的行动和观察,并且将其他参与者的存在摘要化,而没有将他们简化为随机过程。 我们的正规主义为设计无法分类的算法提供了简化的界面,这种算法无法被归类为单一或多剂,并用于制定适用于这些领域的理论。 我们展示了POHP正式主义如何统一传统模型,包括Markov决定过程、Markov游戏、广泛形式游戏及其部分可观察扩展,而没有引入繁琐的技术机械或违反强化学习的哲学基础。 我们通过简洁地探索可观察的顺序合理性、重塑广泛形式的尽量减少遗憾算法以及更笼统地研究EFR的理论特性来说明我们的形式主义的效用。