Safety exploration can be regarded as a constrained Markov decision problem where the expected long-term cost is constrained. Previous off-policy algorithms convert the constrained optimization problem into the corresponding unconstrained dual problem by introducing the Lagrangian relaxation technique. However, the cost function of the above algorithms provides inaccurate estimations and causes the instability of the Lagrange multiplier learning. In this paper, we present a novel off-policy reinforcement learning algorithm called Conservative Distributional Maximum a Posteriori Policy Optimization (CDMPO). At first, to accurately judge whether the current situation satisfies the constraints, CDMPO adapts distributional reinforcement learning method to estimate the Q-function and C-function. Then, CDMPO uses a conservative value function loss to reduce the number of violations of constraints during the exploration process. In addition, we utilize Weighted Average Proportional Integral Derivative (WAPID) to update the Lagrange multiplier stably. Empirical results show that the proposed method has fewer violations of constraints in the early exploration process. The final test results also illustrate that our method has better risk control.
翻译:安全勘探可被视为限制预期长期成本的一个限制的Markov决定问题。以前的离政策算法通过采用拉格朗加放松技术,将限制优化问题转化为相应的不受限制的双重问题。然而,上述算法的成本功能提供了不准确的估计,并造成拉格朗格乘数学习的不稳定性。在本文件中,我们提出了一个新的非政策强化学习算法,称为“保守分配最大后期政策优化 ” 。首先,为了准确判断当前情况是否满足了这些限制,CDMPO调整了分配强化学习方法,以估计Q功能和C功能。随后,CDMPO利用保守的价值函数损失来减少在勘探过程中违反限制的次数。此外,我们利用加权平均比例综合衍生法(WAPID)来更新Lagrange乘数刺杀法(WAPID ) 。Epricalal结果显示,拟议的方法在早期勘探过程中没有多少违反限制。最后测试结果还表明,我们的方法有更好的风险控制。