In this paper, we consider risk-sensitive sequential decision-making in Reinforcement Learning (RL). Our contributions are two-fold. First, we introduce a novel and coherent quantification of risk, namely composite risk, which quantifies the joint effect of aleatory and epistemic risk during the learning process. Existing works considered either aleatory or epistemic risk individually, or as an additive combination. We prove that the additive formulation is a particular case of the composite risk when the epistemic risk measure is replaced with expectation. Thus, the composite risk is more sensitive to both aleatory and epistemic uncertainty than the individual and additive formulations. We also propose an algorithm, SENTINEL-K, based on ensemble bootstrapping and distributional RL for representing epistemic and aleatory uncertainty respectively. The ensemble of K learners uses Follow The Regularised Leader (FTRL) to aggregate the return distributions and obtain the composite risk. We experimentally verify that SENTINEL-K estimates the return distribution better, and while used with composite risk estimates, demonstrates higher risk-sensitive performance than state-of-the-art risk-sensitive and distributional RL algorithms.
翻译:在本文中,我们考虑在强化学习中进行对风险敏感的顺序决策(RL)。我们的贡献是双重的。首先,我们引入了一种新的和连贯的风险量化方法,即综合风险,它量化了学习过程中感官和感官风险的共同影响。现有的工程有的被视为个别的感官或感官风险,有的被视为一种添加剂组合。我们证明,当以预期取代认知风险措施时,添加剂是综合风险的一个特定案例。因此,综合风险比个人和添加剂更敏感于感官和感官不确定性。我们还提议一种算法,SentineL-K,以共同制式靴式和分布式RL为基础,分别代表感官和感官不确定性。K学生的组合使用“附则定型导师(FTRL)”来汇总回归分布并获得综合风险。我们实验性地核实SentineL-K对回归分布的预测比个人和添加式制剂更准确,同时结合综合风险估计,显示风险敏感度比国家分布和风险度高。