Risk-sensitive reinforcement learning (RL) has become a popular tool for controlling the risk of uncertain outcomes and ensuring reliable performance in highly stochastic sequential decision-making problems. While Policy Gradient (PG) methods have been developed for risk-sensitive RL, it remains unclear if these methods enjoy the same global convergence guarantees as in the risk-neutral case \citep{mei2020global,agarwal2021theory,cen2022fast,bhandari2024global}. In this paper, we consider a class of dynamic time-consistent risk measures, named Expected Conditional Risk Measures (ECRMs), and derive PG and Natural Policy Gradient (NPG) updates for ECRMs-based RL problems. We provide global optimality {and iteration complexities} of the proposed algorithms under the following four settings: (i) PG with constrained direct parameterization, (ii) PG with softmax parameterization and log barrier regularization, (iii) NPG with softmax parameterization and entropy regularization, and (iv) approximate NPG with inexact policy evaluation. Furthermore, we test a risk-averse REINFORCE algorithm \citep{williams1992simple} and a risk-averse NPG algorithm \citep{kakade2001natural} on a stochastic Cliffwalk environment to demonstrate the efficacy of our methods and the importance of risk control.
翻译:暂无翻译