In many problems, it is desirable to optimize an objective function while imposing constraints on some other objectives. A Constrained Partially Observable Markov Decision Process (C-POMDP) allows modeling of such problems under transition uncertainty and partial observability. Typically, the constraints in C-POMDPs enforce a threshold on expected cumulative costs starting from an initial state distribution. In this work, we first show that optimal C-POMDP policies may violate Bellman's principle of optimality and thus may exhibit unintuitive behaviors, which can be undesirable for some (e.g., safety critical) applications. Additionally, online re-planning with C-POMDPs is often ineffective due to the inconsistency resulting from the violation of Bellman's principle of optimality. To address these drawbacks, we introduce a new formulation: the Recursively-Constrained POMDP (RC-POMDP), that imposes additional history-dependent cost constraints on the C-POMDP. We show that, unlike C-POMDPs, RC-POMDPs always have deterministic optimal policies, and that optimal policies obey Bellman's principle of optimality. We also present a point-based dynamic programming algorithm that synthesizes admissible near-optimal policies for RC-POMDPs. Evaluations on a set of benchmark problems demonstrate the efficacy of our algorithm and show that policies for RC-POMDPs produce more desirable behaviors than policies for C-POMDPs.
翻译:暂无翻译