Though deep reinforcement learning (DRL) has obtained substantial success, it may encounter catastrophic failures due to the intrinsic uncertainty of both transition and observation. Most of the existing methods for safe reinforcement learning can only handle transition disturbance or observation disturbance since these two kinds of disturbance affect different parts of the agent; besides, the popular worst-case return may lead to overly pessimistic policies. To address these issues, we first theoretically prove that the performance degradation under transition disturbance and observation disturbance depends on a novel metric of Value Function Range (VFR), which corresponds to the gap in the value function between the best state and the worst state. Based on the analysis, we adopt conditional value-at-risk (CVaR) as an assessment of risk and propose a novel reinforcement learning algorithm of CVaR-Proximal-Policy-Optimization (CPPO) which formalizes the risk-sensitive constrained optimization problem by keeping its CVaR under a given threshold. Experimental results show that CPPO achieves a higher cumulative reward and is more robust against both observation and transition disturbances on a series of continuous control tasks in MuJoCo.
翻译:虽然深入强化学习(DRL)取得了巨大成功,但由于过渡和观察的内在不确定性,它可能遭遇灾难性的失败。安全强化学习的现有方法大多只能处理过渡性扰动或观察干扰,因为这两种类型的扰动影响到代理人的不同部分;此外,流行的最坏情况返回可能导致过于悲观的政策。为了解决这些问题,我们首先从理论上证明,过渡性扰动和观察扰动下的性能退化取决于价值函数范围(VFR)的新指标,该指标与最佳状态和最坏状态之间的价值功能差距相对应。根据分析,我们采用有条件的高风险价值(CVaR)作为风险评估,并提出CVaR-Proximal-Political-Popimization(CPPO)的新型强化学习算法,该算法通过将具有风险敏感性的制约性优化问题保持在一定的阈值之下,从而正式确定CVaR的优化问题。实验结果表明,CPPO取得了更高的累积奖赏,并且对MuJoCo一系列连续控制任务的观察和过渡性干扰都比较有力。