Substantial advancements to model-based reinforcement learning algorithms have been impeded by the model-bias induced by the collected data, which generally hurts performance. Meanwhile, their inherent sample efficiency warrants utility for most robot applications, limiting potential damage to the robot and its environment during training. Inspired by information theoretic model predictive control and advances in deep reinforcement learning, we introduce Model Predictive Actor-Critic (MoPAC), a hybrid model-based/model-free method that combines model predictive rollouts with policy optimization as to mitigate model bias. MoPAC leverages optimal trajectories to guide policy learning, but explores via its model-free method, allowing the algorithm to learn more expressive dynamics models. This combination guarantees optimal skill learning up to an approximation error and reduces necessary physical interaction with the environment, making it suitable for real-robot training. We provide extensive results showcasing how our proposed method generally outperforms current state-of-the-art and conclude by evaluating MoPAC for learning on a physical robotic hand performing valve rotation and finger gaiting--a task that requires grasping, manipulation, and then regrasping of an object.
翻译:收集的数据所引致的模型强化学习算法的重大进步受到模型-偏差的阻碍,通常会损害性能。与此同时,其固有的样本效率使得大多数机器人应用都具有实用性,限制了机器人及其环境在培训过程中可能受到的损害。在信息理论模型预测控制和深层强化学习进步的启发下,我们引入了模型预测动因-加速(MoPAC),这是一种混合模型/无模型方法,将模型预测推出与政策优化结合起来,以减少模型偏差。 移动和空调部利用最佳轨迹指导政策学习,但通过不使用模型的方法进行探索,使算法能够学习更清晰的动态模型。这种组合保证了最佳技能学习到近似错误,并减少与环境的必要物理互动,使之适合真实机器人培训。我们提供了广泛的结果,展示了我们拟议的方法一般如何超越当前状态-艺术,并通过评价移动和手指演练的物理机器人手来学习需要掌握、操纵和重新定位一个物体的任务。