以模型为基础的物理综合强化学习 (Physics-Informed Model-Based Reinforcement Learning)

We apply reinforcement learning (RL) to robotics. One of the drawbacks of traditional RL algorithms has been their poor sample efficiency. One approach to improve it is model-based RL. We learn a model of the environment, essentially its dynamics and reward function, use it to generate imaginary trajectories and backpropagate through them to update the policy, exploiting the differentiability of the model. Intuitively, learning more accurate models should lead to better performance. Recently, there has been growing interest in developing better deep neural network based dynamics models for physical systems, through better inductive biases. We focus on robotic systems undergoing rigid body motion. We compare two versions of our model-based RL algorithm, one which uses a standard deep neural network based dynamics model and the other which uses a much more accurate, physics-informed neural network based dynamics model. We show that, in environments that are not sensitive to initial conditions, model accuracy matters only to some extent, as numerical errors accumulate slowly. In these environments, both versions achieve similar average-return, while the physics-informed version achieves better sample efficiency. We show that, in environments that are sensitive to initial conditions, model accuracy matters a lot, as numerical errors accumulate fast. In these environments, the physics-informed version achieves significantly better average-return and sample efficiency. We show that, in challenging environments, where we need a lot of samples to learn, physics-informed model-based RL can achieve better asymptotic performance than model-free RL, by generating accurate imaginary data, which allows it to perform many more policy updates. In these environments, our physics-informed model-based RL approach achieves better average-return than Soft Actor-Critic, a SOTA model-free RL algorithm.

翻译：我们对机器人应用强化学习(RL) 。传统的 RL 算法的一个缺点是它们的试样效率差。改进它的方法之一是模型式RL。我们学习一种环境模型,主要是其动态和奖励功能,用它来产生想象的轨迹,通过它们来更新政策,利用模型的不同性能。直觉地,学习更准确的模型应该导致更好的性能。最近,人们越来越有兴趣为物理系统开发更深的以神经网络为基础的动态模型,通过更好的感知偏差。我们侧重于正在进行僵硬的机体运动的机器人系统。我们比较了基于模型的两种模型,主要是其动态和奖赏功能,其中一种是标准的深线性网络动态模型,而另一种则是利用它来生成更精确的神经网络模型。我们显示,在对初始条件不敏感的环境中,模型的准确性能只是某种程度的自由化。在这些环境中,许多版本的数据都能够实现相似的平均回报,而物理学感知的版本则能取得更好的样本效率。我们显示,在快速的周期环境中,我们可以实现一个敏感的、高级的周期性环境, 显示一个比高级的周期的周期, 。