Dynamic optimization of mean and variance in Markov decision processes (MDPs) is a long-standing challenge caused by the failure of dynamic programming. In this paper, we propose a new approach to find the globally optimal policy for combined metrics of steady-state mean and variance in an infinite-horizon undiscounted MDP. By introducing the concepts of pseudo mean and pseudo variance, we convert the original problem to a bilevel MDP problem, where the inner one is a standard MDP optimizing pseudo mean-variance and the outer one is a single parameter selection problem optimizing pseudo mean. We use the sensitivity analysis of MDPs to derive the properties of this bilevel problem. By solving inner standard MDPs for pseudo mean-variance optimization, we can identify worse policy spaces dominated by optimal policies of the pseudo problems. We propose an optimization algorithm which can find the globally optimal policy by repeatedly removing worse policy spaces. The convergence and complexity of the algorithm are studied. Another policy dominance property is also proposed to further improve the algorithm efficiency. Numerical experiments demonstrate the performance and efficiency of our algorithms. To the best of our knowledge, our algorithm is the first that efficiently finds the globally optimal policy of mean-variance optimization in MDPs. These results are also valid for solely minimizing the variance metrics in MDPs.
翻译:马尔科夫决策进程(MDPs)的中值和差异的动态优化是动态编程失败的长期挑战。在本文件中,我们提出一种新的方法,以寻找全球最佳的政策,将稳态平均值和差异的组合度在一个无限和偏差的 MDP 中,找到全球最佳政策。通过引入假的中值和假的差值概念,我们将最初的问题转换为双级MDP问题,其中内部的问题是标准的MDP优化伪平均差值,外部是单一参数选择问题,优化假平均值。我们利用 MDP的灵敏度分析来得出这一双级问题的性质。通过解决内部标准的 MDP 以假中差差值优化,我们可以找出由伪问题最佳政策控制的更糟糕的政策空间。我们提出了最优化的算法,通过反复消除更差的政策空间,找到全球最佳政策。还研究了算法的趋同性和复杂性。还提议了另一种政策优势属性,以进一步提高算法效率。数值实验显示了我们的算法的性和效率。我们最优秀的知识是,我们的算法在最大程度上也是最优化的MDP。</s>