关于最佳控制和期望 -- -- 最大化:理论和对数值的展望 (On Optimal Control and Expectation-Maximisation: Theory and an Outlook Towards Algorithms)

In this work we demonstrate how both the Stochastic and Risk Sensitive Optimal Control problem can be treated by means of the Expectation-Maximisation algorithm. We show how such a treatment materialises into two separate iterative programs that each generate a unique but closely related sequence of density functions. We motivate to interpret these density functions as beliefs, ergo as probabilistic proxies for the deterministic optimal policy. More formally two fixed point iteration schemes are derived with the stationary point coinciding with the deterministic optimal policies on behalf of the proven convergence of Expectation-Maximisation methods. We are inclined to point out our results are intimately related with the paradigm of Control as Inference. Control as inference here refers to a collection of approaches which aim is also to recast optimal control as an instance of probabilistic inference. Although said paradigm already resulted in the development of several powerful Reinforcement Learning algorithms, the fundamental problem statement usually is introduced by teleological arguments. We argue that the present results demonstrate that earlier established Control as Inference frameworks in fact isolate a single step from either of the proposed iterative programs. In any case the present treatment provides them with a deontological argument of validity. By exposing the underlying technical mechanism we aim to contribute to the general acceptance of Control as Inference as a framework superseding the present Optimal Control paradigm. In order to motivate the general relevance of the presented treatment we further discuss parallels with Path Integral Control and other areas of research before sketching the outlines of future algorithmic development.

翻译：在这项工作中,我们展示了如何通过预期-最大化算法来处理托盘和风险敏感最佳控制问题。我们展示了这种处理方法如何将结果转化为两个不同的迭代程序,每个程序都产生一个独特但密切相关的密度函数序列。我们激励将这些密度函数解释为信念,即作为确定性最佳政策的一种概率性替代物来解释。更正式的两种固定点迭代方案与固定点的最佳政策同时产生,以预期-最大化方法已经证明的趋同形式为代表。我们倾向于指出,我们的结果与控制作为推论的范式密切相关。此处的调控是指一系列方法,其目的也是将最佳控制作为确定性推导出最佳的推论。尽管上述模式已经导致若干强大的加强学习算法的发展,但基本问题说明通常由远程学论证提出。我们认为,先前建立的控制作为推论的未来框架,事实上与控制方法的单一步骤密切相关,两者中的任何一步都与作为推导论基础的预估性分析程序的相关性有关。任何这样的范式都表明,我们通过讨论目前的逻辑推论性推介了目前的逻辑,从而推导出目前的逻辑推导出目前的精确度,从而推介了目前对结果,从而推论的推理法的推理法的推理,从而,从而推理法的推理法的推理法系的推理法系的推理法系的推理法系的推理法系的推理法系的推理法系的推论,作为了目前的法系。