Recent works in Reinforcement Learning (RL) combine model-free (Mf)-RL algorithms with model-based (Mb)-RL approaches to get the best from both: asymptotic performance of Mf-RL and high sample-efficiency of Mb-RL. Inspired by these works, we propose a hierarchical framework that integrates online learning for the Mb-trajectory optimization with off-policy methods for the Mf-RL. In particular, two loops are proposed, where the Dynamic Mirror Descent based Model Predictive Control (DMD-MPC) is used as the inner loop Mb-RL to obtain an optimal sequence of actions. These actions are in turn used to significantly accelerate the outer loop Mf-RL. We show that our formulation is generic for a broad class of MPC-based policies and objectives, and includes some of the well-known Mb-Mf approaches. We finally introduce a new algorithm: Mirror-Descent Model Predictive RL (M-DeMoRL), which uses Cross-Entropy Method (CEM) with elite fractions for the inner loop. Our experiments show faster convergence of the proposed hierarchical approach on benchmark MuJoCo tasks. We also demonstrate hardware training for trajectory tracking in a 2R leg and hardware transfer for robust walking in a quadruped. We show that the inner-loop Mb-RL significantly decreases the number of training iterations required in the real system, thereby validating the proposed approach.
翻译:在加强学习(RL)的近期工作中,我们提出了一个等级框架,将无模型的(Mf)-RL算法与基于模型的(Mb)-RL算法结合起来,以便从中取得最佳的行动顺序:Mf-RL的无症状性能和Mb-RL的高样本效率。我们受到这些工作的启发,提出一个分级框架,将Mb-轨迹优化的在线学习与Mf-RL的离政策方法结合起来。特别是,我们提出了两个循环,其中以动态镜源模型为基础的预测控制(DMD-MPC)作为内部循环Mb-RL的方法,以获得最佳的行动顺序。这些行动又被用来大大加快Mb-RL的外部循环 Mf-R的性能性能和高样本效率。我们表明,我们为Mb-Mf-MfL制定的一系列广泛的政策和目标,包括一些众所周知的Mb-Mf方法。我们最后引入了一个新的算法:闪光模型预测RL(M-DeMRL),它使用C-Eentroprob 方法(CEMMEM),使用内部循环的内环方法,其中拟议的精锐化系统将精锐分分解系统,用以在内部轨道上显示轨道轨道轨道的硬导的硬导,我们提出的硬导的硬导。