Keeping risk under control is often more crucial than maximizing expected reward in real-world decision-making situations, such as finance, robotics, autonomous driving, etc. The most natural choice of risk measures is variance, while it penalizes the upside volatility as much as the downside part. Instead, the (downside) semivariance, which captures negative deviation of a random variable under its mean, is more suitable for risk-averse proposes. This paper aims at optimizing the mean-semivariance (MSV) criterion in reinforcement learning w.r.t. steady rewards. Since semivariance is time-inconsistent and does not satisfy the standard Bellman equation, the traditional dynamic programming methods are inapplicable to MSV problems directly. To tackle this challenge, we resort to the Perturbation Analysis (PA) theory and establish the performance difference formula for MSV. We reveal that the MSV problem can be solved by iteratively solving a sequence of RL problems with a policy-dependent reward function. Further, we propose two on-policy algorithms based on the policy gradient theory and the trust region method. Finally, we conduct diverse experiments from simple bandit problems to continuous control tasks in MuJoCo, which demonstrate the effectiveness of our proposed methods.
翻译:控制风险往往比在金融、机器人、自主驾驶等现实决策情况下尽量扩大预期奖励更为关键。 最自然的风险选择措施是差异性的,而最自然的选择风险措施是差异性的,它惩罚上下方的波动性与下方一样多。相反,(下方的)半变量,它捕捉随机变量在其平均值下的负偏差,更适合风险反向建议。本文件的目的是在强化学习中优化中平均偏差(MSV)标准。由于半偏差与时间不一致,不能满足标准的贝尔曼方程式,传统的动态方案拟订方法无法直接适用于MSV问题。为了应对这一挑战,我们采用“周期性分析”理论,为MSV确定性差公式。我们发现,MSV问题可以通过反复解决一系列基于政策性能的奖励功能的RL问题来解决。此外,我们根据政策梯度理论和信任区域方法提出了两种政策性算法。最后,我们从简单的 Mujo 实验到持续控制方法,我们所提议的效果。