We propose policy-gradient algorithms for solving the problem of control in a risk-sensitive reinforcement learning context. The objective of our algorithms is to maximize the distortion risk measure (DRM) of the cumulative reward in an episodic Markov decision process. We derive a variant of the policy gradient theorem that caters to the DRM objective. Using this theorem in conjunction with a likelihood ratio-based gradient estimation scheme, we propose policy gradient algorithms for optimizing DRM in both on-policy and off-policy RL settings. We derive non-asymptotic bounds that establish the convergence of our algorithms to an approximate stationary point of the DRM objective.
翻译:我们提出政策梯度算法,以便在对风险敏感的强化学习背景下解决控制问题。我们的算法的目的是在附带的Markov决定程序中最大限度地扩大累积奖赏的扭曲风险度量(DRM),我们从政策梯度定理中得出一个符合DRM目标的变式。我们利用这个定理和基于可能比率的梯度估计办法,提出政策梯度算法,在政策和非政策RL设置中优化DRM。我们得出非简易界限,使我们的算法与DRM目标的近似固定点趋于一致。