This paper studies stochastic control problems with the action space taken to be the space of measures, regularized by the relative entropy. We identify suitable metric space on which we construct a gradient flow for the measure-valued control process along which the cost functional is guaranteed to decrease. It is shown that any invariant measure of this gradient flow satisfies the Pontryagin optimality principle. If the problem we work with is sufficiently convex, the gradient flow converges exponentially fast. Furthermore, the optimal measure-valued control admits Bayesian interpretation which means that one can incorporate prior knowledge when solving stochastic control problem. This work is motivated by a desire to extend the theoretical underpinning for the convergence of stochastic gradient type algorithms widely used in the reinforcement learning community to solve control problems.
翻译:本文研究了所采取行动空间作为措施空间的随机控制问题,由相对的酶进行调节。我们确定了适当的测量空间,用以构建测量值控制过程的梯度流,保证成本功能下降,并表明这种梯度流的任何不定的测量方法符合Pontryagin最佳性原则。如果我们处理的问题足够紧凑,梯度流会迅速成倍地汇合。此外,最佳测量值控制法承认了巴耶斯人的解释,这意味着在解决随机控制问题时可以包含先前的知识。这项工作的动机是希望扩大理论基础,使强化学习社区广泛使用的随机梯度型算法能够融合,以解决控制问题。