Sample-based trajectory optimisers are a promising tool for the control of robotics with non-differentiable dynamics and cost functions. Contemporary approaches derive from a restricted subclass of stochastic optimal control where the optimal policy can be expressed in terms of an expectation over stochastic paths. By estimating the expectation with Monte Carlo sampling and reinterpreting the process as exploration noise, a stochastic search algorithm is obtained tailored to (deterministic) trajectory optimisation. For the purpose of future algorithmic development, it is essential to properly understand the underlying theoretical foundations that allow for a principled derivation of such methods. In this paper we make a connection between entropy regularisation in optimisation and deterministic optimal control. We then show that the optimal policy is given by a belief function rather than a deterministic function. The policy belief is governed by a Bayesian-type update where the likelihood can be expressed in terms of a conditional expectation over paths induced by a prior policy. Our theoretical investigation firmly roots sample based trajectory optimisation in the larger family of control as inference. It allows us to justify a number of heuristics that are common in the literature and motivate a number of new improvements that benefit convergence.
翻译:以抽样为基础的轨迹选择器是控制具有非差别动态和成本功能的机器人的一个很有希望的工具。当代方法来自有限的随机最佳控制亚类,其中最佳政策可以用对随机路径的预期来表示。通过利用蒙特卡洛取样来估计预期,并重新将这一过程解释为勘探噪音,一种随机搜索算法是专门为(确定性)轨迹优化而设计的。为了今后的算法发展的目的,必须正确理解允许有原则地衍生此类方法的基本理论基础。在本文中,我们把优化中的刺激常规化与确定性最佳控制联系起来。然后我们表明,最佳政策是由信仰功能而不是确定性功能提供的。政策信念受巴伊西亚式更新的制约,其中可以用对先前政策引出的路径的有条件期望来表示。我们的理论调查必须正确理解基于轨迹的理论基础,以便在更大的控制大家庭中进行有原则的优化。它使我们有理由认为,在新的文献和数字中具有共同的超感趋同效果。