The problem of continuous inverse optimal control (over finite time horizon) is to learn the unknown cost function over the sequence of continuous control variables from expert demonstrations. In this article, we study this fundamental problem in the framework of energy-based model, where the observed expert trajectories are assumed to be random samples from a probability density function defined as the exponential of the negative cost function up to a normalizing constant. The parameters of the cost function are learned by maximum likelihood via an "analysis by synthesis" scheme, which iterates (1) synthesis step: sample the synthesized trajectories from the current probability density using the Langevin dynamics via back-propagation through time, and (2) analysis step: update the model parameters based on the statistical difference between the synthesized trajectories and the observed trajectories. Given the fact that an efficient optimization algorithm is usually available for an optimal control problem, we also consider a convenient approximation of the above learning method, where we replace the sampling in the synthesis step by optimization. Moreover, to make the sampling or optimization more efficient, we propose to train the energy-based model simultaneously with a top-down trajectory generator via cooperative learning, where the trajectory generator is used to fast initialize the synthesis step of the energy-based model. We demonstrate the proposed methods on autonomous driving tasks, and show that they can learn suitable cost functions for optimal control.
翻译:连续反向最佳控制( 时间范围有限) 的问题在于了解专家演示连续控制变量序列的未知成本函数。 在本条中,我们研究了基于能源模型框架中的这一根本问题。 观察到的专家轨迹被假定为概率密度函数的随机样本,该概率密度函数的定义是负成本函数的指数,直到一个正常化的常数。 成本函数的参数是通过“ 综合分析” 方案通过最大可能性来学习的,该办法转述:(1) 综合步骤:(1) 综合步骤:通过时间反向分析,用兰格文动态从当前概率密度中抽取合成轨迹;(2) 分析步骤:根据综合轨迹与所观察到轨迹之间的统计差异更新模型参数。鉴于有效优化算法通常用于最佳控制问题,我们还考虑对上述学习方法进行最方便的近似接近,我们用优化取代综合步骤中的取样或优化。 此外,为了提高取样或优化效率,我们提议同时用一个基于能源的模型,同时用一个基于能源的模型,同时用一个基于综合轨迹的统计参数,即综合综合综合综合综合的顶部,我们建议,通过合作学习发电机使用的快速学习模式。 快速学习模式。