Maximising a cumulative reward function that is Markov and stationary, i.e., defined over state-action pairs and independent of time, is sufficient to capture many kinds of goals in a Markov decision process (MDP). However, not all goals can be captured in this manner. In this paper we study convex MDPs in which goals are expressed as convex functions of the stationary distribution and show that they cannot be formulated using stationary reward functions. Convex MDPs generalize the standard reinforcement learning (RL) problem formulation to a larger framework that includes many supervised and unsupervised RL problems, such as apprenticeship learning, constrained MDPs, and so-called `pure exploration'. Our approach is to reformulate the convex MDP problem as a min-max game involving policy and cost (negative reward) `players', using Fenchel duality. We propose a meta-algorithm for solving this problem and show that it unifies many existing algorithms in the literature.
翻译:最大程度的累积奖赏功能是Markov和固定的,即,在州行动对和时间独立的基础上界定的累积奖赏功能,足以在Markov决策过程(MDP)中捕捉到许多类型的目标。然而,并非所有目标都可以以这种方式捕捉到。在本文中,我们研究的是Convex MDPs,其中将目标表现为固定分布的曲线功能,并表明它们不能使用固定的奖赏功能来制定。Convex MDPs将标准强化学习(RL)问题表述概括为一个更大的框架,其中包括许多受监管和不受监督的RL问题,例如学徒学习、受限制的MDPs和所谓的“纯粹探索”。我们的方法是重新将convex MDP问题作为涉及政策和成本(负酬)的微轴游戏,使用Fenchel的双重性。我们提出了解决该问题的元分法,并表明它使文献中的许多现有算法趋于一致。