安全持续控制,以受约束的基于模式的政策优化优化政策 (Safe Continuous Control with Constrained Model-Based Policy Optimization)

The applicability of reinforcement learning (RL) algorithms in real-world domains often requires adherence to safety constraints, a need difficult to address given the asymptotic nature of the classic RL optimization objective. In contrast to the traditional RL objective, safe exploration considers the maximization of expected returns under safety constraints expressed in expected cost returns. We introduce a model-based safe exploration algorithm for constrained high-dimensional control to address the often prohibitively high sample complexity of model-free safe exploration algorithms. Further, we provide theoretical and empirical analyses regarding the implications of model-usage on constrained policy optimization problems and introduce a practical algorithm that accelerates policy search with model-generated data. The need for accurate estimates of a policy's constraint satisfaction is in conflict with accumulating model-errors. We address this issue by quantifying model-uncertainty as the expected Kullback-Leibler divergence between predictions of an ensemble of probabilistic dynamics models and constrain this error-measure, resulting in an adaptive resampling scheme and dynamically limited rollout horizons. We evaluate this approach on several simulated constrained robot locomotion tasks with high-dimensional action- and state-spaces. Our empirical studies find that our algorithm reaches model-free performances with a 10-20 fold reduction of training samples while maintaining approximate constraint satisfaction levels of model-free methods.

翻译：与传统的RL目标相反,安全探索考虑在预期成本回报中表达的安全限制下最大限度地实现预期回报。我们采用一个基于模型的安全探索算法,以限制高维控制,解决无模型安全探索算法的模型复杂性往往极高的样本复杂性问题。此外,我们提供理论和经验分析,说明模型使用对受限制的政策优化问题的影响,并采用实际算法,加速以模型生成的数据进行政策搜索。与传统的RL目标相反,安全探索考虑在预期成本回报中表达的安全限制下实现预期回报最大化。我们通过量化模型不确定性来解决这一问题,如预期Kullback-Lebell在无模型稳定性安全勘探算法模型的预测之间会出现巨大差异,并限制这种错误计量,从而形成适应性再现方案和动态有限的滚动地平线。我们评估了在使用高维度的模型进行模拟不受限制的机器人离心操作时,在使用高维摄氏度的模型操作法下,在使用高维度的模型测试方法下,在高维度上找到一种不动的不动的模型不动缩的模型操作方法。