通过差异物理学习 (Imitation Learning via Differentiable Physics)

Existing imitation learning (IL) methods such as inverse reinforcement learning (IRL) usually have a double-loop training process, alternating between learning a reward function and a policy and tend to suffer long training time and high variance. In this work, we identify the benefits of differentiable physics simulators and propose a new IL method, i.e., Imitation Learning via Differentiable Physics (ILD), which gets rid of the double-loop design and achieves significant improvements in final performance, convergence speed, and stability. The proposed ILD incorporates the differentiable physics simulator as a physics prior into its computational graph for policy learning. It unrolls the dynamics by sampling actions from a parameterized policy, simply minimizing the distance between the expert trajectory and the agent trajectory, and back-propagating the gradient into the policy via temporal physics operators. With the physics prior, ILD policies can not only be transferable to unseen environment specifications but also yield higher final performance on a variety of tasks. In addition, ILD naturally forms a single-loop structure, which significantly improves the stability and training speed. To simplify the complex optimization landscape induced by temporal physics operations, ILD dynamically selects the learning objectives for each state during optimization. In our experiments, we show that ILD outperforms state-of-the-art methods in a variety of continuous control tasks with Brax, requiring only one expert demonstration. In addition, ILD can be applied to challenging deformable object manipulation tasks and can be generalized to unseen configurations.

翻译：现有的模拟学习方法,如反向强化学习(IRL)通常有一个双环培训过程,在学习奖励功能和政策之间交替,往往会经历长时间的培训时间和高度差异。在这项工作中,我们确定不同物理模拟器的好处,并提出一种新的模拟方法,即通过差异物理学的模拟学习(ILD),这种模拟方法可以摆脱双环设计,在最后性能、趋同速度和稳定性方面实现重大改进。拟议的ILD将不同的物理模拟器作为物理模拟器纳入政策学习的计算图之前。它通过从参数化的政策中取样行动,将动力解开,只是将专家轨迹与代理人轨迹之间的距离最小化,并通过时间物理操作器将梯度反射到政策中。随着物理前的物理学,ILD政策不仅可以转移到看不见的环境规格,而且只能在各种任务中产生更高的最后性能。此外,ILD自然地形成一个单一偏向目标结构,大大改进了在政策上应用的稳定性和培训速度。为了简化我们不断变化的物理结构,在不断的实验室中,我们可以选择一种方向上,我们不断调整的系统,我们所需要的方向,通过一种方向,以简化的系统,以简化的方法来选择我们所需要的方向。