具有安全限制的保守分配强化学习 (Conservative Distributional Reinforcement Learning with Safety Constraints)

Safety exploration can be regarded as a constrained Markov decision problem where the expected long-term cost is constrained. Previous off-policy algorithms convert the constrained optimization problem into the corresponding unconstrained dual problem by introducing the Lagrangian relaxation technique. However, the cost function of the above algorithms provides inaccurate estimations and causes the instability of the Lagrange multiplier learning. In this paper, we present a novel off-policy reinforcement learning algorithm called Conservative Distributional Maximum a Posteriori Policy Optimization (CDMPO). At first, to accurately judge whether the current situation satisfies the constraints, CDMPO adapts distributional reinforcement learning method to estimate the Q-function and C-function. Then, CDMPO uses a conservative value function loss to reduce the number of violations of constraints during the exploration process. In addition, we utilize Weighted Average Proportional Integral Derivative (WAPID) to update the Lagrange multiplier stably. Empirical results show that the proposed method has fewer violations of constraints in the early exploration process. The final test results also illustrate that our method has better risk control.

翻译：安全勘探可被视为限制预期长期成本的一个限制的Markov决定问题。以前的离政策算法通过采用拉格朗加放松技术,将限制优化问题转化为相应的不受限制的双重问题。然而,上述算法的成本功能提供了不准确的估计,并造成拉格朗格乘数学习的不稳定性。在本文件中,我们提出了一个新的非政策强化学习算法,称为“保守分配最大后期政策优化 ” 。首先,为了准确判断当前情况是否满足了这些限制,CDMPO调整了分配强化学习方法,以估计Q功能和C功能。随后,CDMPO利用保守的价值函数损失来减少在勘探过程中违反限制的次数。此外,我们利用加权平均比例综合衍生法(WAPID)来更新Lagrange乘数刺杀法(WAPID ) 。Epricalal结果显示,拟议的方法在早期勘探过程中没有多少违反限制。最后测试结果还表明,我们的方法有更好的风险控制。

相关内容

拉格朗日乘子

关注 0

在数学优化中，拉格朗日乘数法是一种用于寻找受等式约束的函数的局部最大值和最小值的策略（即，必须满足所选变量值必须完全满足一个或多个方程式的条件）。它以数学家约瑟夫·路易斯·拉格朗日命名。基本思想是将受约束的问题转换为某种形式，以便仍可以应用无约束问题的派生检验。函数的梯度与约束的梯度之间的关系很自然地导致了原始问题的重构，即拉格朗日函数。

计算机科学课程与视频课件合集，Computer Science courses with video lectures

专知会员服务

37+阅读 · 2022年1月24日

专知会员服务

39+阅读 · 2020年11月3日

不可错过！UIUC最新《统计强化学习》课程！

专知会员服务

53+阅读 · 2020年9月7日

最新《非光滑优化》十讲硬核课程，剑桥大学梁经纬博士主讲

专知会员服务

33+阅读 · 2020年8月14日

Linux导论，Introduction to Linux，96页ppt