Multi-objective optimization models that encode ordered sequential constraints provide a solution to model various challenging problems including encoding preferences, modeling a curriculum, and enforcing measures of safety. A recently developed theory of topological Markov decision processes (TMDPs) captures this range of problems for the case of discrete states and actions. In this work, we extend TMDPs towards continuous spaces and unknown transition dynamics by formulating, proving, and implementing the policy gradient theorem for TMDPs. This theoretical result enables the creation of TMDP learning algorithms that use function approximators, and can generalize existing deep reinforcement learning (DRL) approaches. Specifically, we present a new algorithm for a policy gradient in TMDPs by a simple extension of the proximal policy optimization (PPO) algorithm. We demonstrate this on a real-world multiple-objective navigation problem with an arbitrary ordering of objectives both in simulation and on a real robot.
翻译:将顺序限制编码的多目标优化模型为模拟各种具有挑战性的问题提供了一种解决办法,包括编码偏好、课程建模和强制执行安全措施。最近开发的顶层学Markov决策程序理论(TMDPs)为离散状态和行动收集了这一系列问题。在这项工作中,我们通过为TMDPs制定、证明和实施政策梯度理论,将TMDP扩大到连续空间和未知的过渡动态。这一理论结果使得TMDP学习算法得以创建,这些算法使用功能相近的功能,并可以推广现有的深度加固学习(DRL)方法。具体地说,我们为TMDPs提供了一种新的政策梯度算法,简单扩展了准政策优化(PPO)算法。我们在现实世界的多目标导航问题上展示了这一点,在模拟和真正的机器人上任意定了目标。