We propose policy-gradient algorithms for solving the problem of control in a risk-sensitive reinforcement learning (RL) context. The objective of our algorithm is to maximize the distorted risk measure (DRM) of the cumulative reward in an episodic Markov decision process (MDP). We derive a variant of the policy gradient theorem that caters to the DRM objective. Using this theorem in conjunction with a likelihood ratio (LR) based gradient estimation scheme, we propose policy gradient algorithms for optimizing DRM in both on-policy and off-policy RL settings. We derive non-asymptotic bounds that establish the convergence of our algorithms to an approximate stationary point of the DRM objective.
翻译:我们提出政策梯度算法,以解决在风险敏感强化学习(RL)背景下的控制问题。我们的算法的目标是最大限度地扩大在附带的Markov决定程序中累积奖励的扭曲风险计量(DRM)。我们从政策梯度定理中得出一个符合DRM目标的变式。我们利用这个定理法和基于可能性比率的梯度估计办法,提出政策梯度算法,在政策和非政策RL设置中优化DRM。我们得出了非不设防线,使我们的算法与DRM目标的大致固定点相一致。