CEM-GD: 与基于示范的加强学习的逐步从属规划员的跨实体方法 (CEM-GD: Cross-Entropy Method with Gradient Descent Planner for Model-Based Reinforcement Learning)

Current state-of-the-art model-based reinforcement learning algorithms use trajectory sampling methods, such as the Cross-Entropy Method (CEM), for planning in continuous control settings. These zeroth-order optimizers require sampling a large number of trajectory rollouts to select an optimal action, which scales poorly for large prediction horizons or high dimensional action spaces. First-order methods that use the gradients of the rewards with respect to the actions as an update can mitigate this issue, but suffer from local optima due to the non-convex optimization landscape. To overcome these issues and achieve the best of both worlds, we propose a novel planner, Cross-Entropy Method with Gradient Descent (CEM-GD), that combines first-order methods with CEM. At the beginning of execution, CEM-GD uses CEM to sample a significant amount of trajectory rollouts to explore the optimization landscape and avoid poor local minima. It then uses the top trajectories as initialization for gradient descent and applies gradient updates to each of these trajectories to find the optimal action sequence. At each subsequent time step, however, CEM-GD samples much fewer trajectories from CEM before applying gradient updates. We show that as the dimensionality of the planning problem increases, CEM-GD maintains desirable performance with a constant small number of samples by using the gradient information, while avoiding local optima using initially well-sampled trajectories. Furthermore, CEM-GD achieves better performance than CEM on a variety of continuous control benchmarks in MuJoCo with 100x fewer samples per time step, resulting in around 25% less computation time and 10% less memory usage. The implementation of CEM-GD is available at $\href{https://github.com/KevinHuang8/CEM-GD}{\text{https://github.com/KevinHuang8/CEM-GD}}$.

翻译：以当前状态{最先进的模型为基础的强化学习算法使用轨迹取样方法,如跨 Entropy 方法(CEM),在连续控制环境下进行规划。这些零级优化器需要对大量轨迹推出方法进行抽样,以选择最佳行动,而对于大型预测地平线或高维行动空间而言,该轨迹规模不高。使用与行动有关的奖励梯度作为更新的一级方法可以缓解这一问题,但由于非Convex优化景观而受当地Optima的影响。为了克服这些问题并实现两个世界的最佳状态,我们提出了一个新的规划师,即带有梯度的跨 Entropy 方法(CEM-GD),将一级推出的方法与CEMM。在执行开始时,CEM-GD使用大量轨迹滚动方法来探索优化地平面景观,避免地方微缩缩缩略图,然后用CMEM-MDMF 更慢一步,然后用CMM-moderoupal 进行更慢的运行。