We consider the problem of learning the optimal policy for infinite-horizon Markov decision processes (MDPs). For this purpose, some variant of Stochastic Mirror Descent is proposed for convex programming problems with Lipschitz-continuous functionals. An important detail is the ability to use inexact values of functional constraints. We analyze this algorithm in a general case and obtain an estimate of the convergence rate that does not accumulate errors during the operation of the method. Using this algorithm, we get the first parallel algorithm for average-reward MDPs with a generative model. One of the main features of the presented method is low communication costs in a distributed centralized setting.
 翻译:我们考虑了学习无限象子Markov(MDPs)决策程序的最佳政策的问题。为此,针对利普施奇茨连续功能的细微编程问题,我们建议了Stochastic Mirror Spores的某种变种。一个重要的细节是使用功能限制的不精确值的能力。我们分析了一般情况下的这种算法,并获得了在方法运行期间没有累积错误的趋同率的估计值。我们使用这种算法,我们获得了具有基因模型的平均奖励 MDP的首种平行算法。所介绍的方法的主要特征之一是分布式集中环境中的低通信成本。