Prior work on safe Reinforcement Learning (RL) has studied risk-aversion to randomness in dynamics (aleatory) and to model uncertainty (epistemic) in isolation. We propose and analyze a new framework to jointly model the risk associated with epistemic and aleatory uncertainties in finite-horizon and discounted infinite-horizon MDPs. We call this framework that combines Risk-Averse and Soft-Robust methods RASR. We show that when the risk-aversion is defined using either EVaR or the entropic risk, the optimal policy in RASR can be computed efficiently using a new dynamic program formulation with a time-dependent risk level. As a result, the optimal risk-averse policies are deterministic but time-dependent, even in the infinite-horizon discounted setting. We also show that particular RASR objectives reduce to risk-averse RL with mean posterior transition probabilities. Our empirical results show that our new algorithms consistently mitigate uncertainty as measured by EVaR and other standard risk measures.
翻译:安全强化学习(RL)先前的工作研究过风险转向动态随机(迁移)和孤立地模拟不确定性(流行性)的风险。我们提议并分析一个新的框架,以共同模拟与迷幻和感知性不确定性相关的风险,在有限和无限偏差的MDP中,我们称之为这个框架,将风险-反向和软体-Robust方法结合起来。我们表明,在使用EVaR或通量风险来界定风险转换时,可以使用新的动态方案制定具有时间依赖的风险水平来有效计算RASR的最佳政策。因此,最佳风险偏向政策具有确定性,但取决于时间,即使在无限偏差的折扣环境中也是如此。我们还表明,特别的RASR目标会降低风险-反风险的RL和平均后向过渡的概率。我们的经验显示,我们的新算法通过EVaR和其他标准风险措施测量,不断减少不确定性。