Risk-sensitive reinforcement learning (RL) has become a popular tool to control the risk of uncertain outcomes and ensure reliable performance in various sequential decision-making problems. While policy gradient methods have been developed for risk-sensitive RL, it remains unclear if these methods enjoy the same global convergence guarantees as in the risk-neutral case. In this paper, we consider a class of dynamic time-consistent risk measures, called Expected Conditional Risk Measures (ECRMs), and derive policy gradient updates for ECRM-based objective functions. Under both constrained direct parameterization and unconstrained softmax parameterization, we provide global convergence of the corresponding risk-averse policy gradient algorithms. We further test a risk-averse variant of REINFORCE algorithm on a stochastic Cliffwalk environment to demonstrate the efficacy of our algorithm and the importance of risk control.
翻译:风险敏感强化学习(RL)已成为控制不确定结果风险和确保各种连续决策问题可靠表现的流行工具。虽然已经为对风险敏感的RL制定了政策梯度方法,但这些方法是否享有与风险中性案例相同的全球趋同保障,仍不清楚。在本文中,我们考虑了一系列动态的时间一致风险措施,称为预期条件风险措施(ECRM),并为基于ECRM的目标功能提供政策梯度更新。在限制的直接参数化和不受限制的软式软体参数化下,我们提供了相应的风险规避政策梯度算法的全球趋同。我们进一步测试了随机偏差的克里夫行环境REINFORCE算法的风险反变量,以展示我们的算法的功效和风险控制的重要性。