Recent works in Reinforcement Learning (RL) combine model-free (Mf)-RL algorithms with model-based (Mb)-RL approaches to get the best from both: asymptotic performance of Mf-RL and high sample-efficiency of Mb-RL. Inspired by these works, we propose a hierarchical framework that integrates online learning for the Mb-trajectory optimization with off-policy methods for the Mf-RL. In particular, two loops are proposed, where the Dynamic Mirror Descent based Model Predictive Control (DMD-MPC) is used as the inner loop to obtain an optimal sequence of actions. These actions are in turn used to significantly accelerate the outer loop Mf-RL. We show that our formulation is generic for a broad class of MPC based policies and objectives, and includes some of the well-known Mb-Mf approaches. Based on the framework we define two algorithms to increase sample efficiency of Off Policy RL and to guide end to end RL algorithms for online adaption respectively. Thus we finally introduce two novel algorithms: Dynamic-Mirror Descent Model Predictive RL(DeMoRL), which uses the method of elite fractions for the inner loop and Soft Actor-Critic (SAC) as the off-policy RL for the outer loop and Dynamic-Mirror Descent Model Predictive Layer(DeMo Layer), a special case of the hierarchical framework which guides linear policies trained using Augmented Random Search(ARS). Our experiments show faster convergence of the proposed DeMo RL, and better or equal performance compared to other Mf-Mb approaches on benchmark MuJoCo control tasks. The DeMo Layer was tested on classical Cartpole and custom-built Quadruped trained using Linear Policy.
翻译:在加强学习(RL)的近期工作中,我们提出了一个分级框架,将Mb-轨迹优化的在线学习与Mf-RL的离政策方法结合起来。 特别是,提出了两个循环,其中以动态镜形源源模型预测控制(DMD-MPC)作为内部循环,以获得最佳的行动序列。这些行动又被用来大大加速Mf-RL的外部循环 Mf-RL 和Mb-RL的高样本效率。我们提出一个等级框架,将Mb-轨迹优化的在线学习与Mf-RL的离政策方法结合起来。基于这个框架,我们定义了两种算法,以提高 Offer 政策 RL 的样本效率,并指导了我们分别进行在线调整的 RL 的终极线性逻辑。因此,我们最后采用了两种新颖的 Rft-Mral-Ral-Ral-Rior Rral-L 策略, 用于Smodel-L 的 Rlor-rmal Rlor-L 常规操作法,用于Smal-ral-ral-R-L 的另一种动作管理。