SENTINEL: 利用以组合为基础的分配强化学习来消除不确定性 (SENTINEL: Taming Uncertainty with Ensemble-based Distributional Reinforcement Learning)

In this paper, we consider risk-sensitive sequential decision-making in model-based Reinforcement Learning (RL). Our contributions are two-fold. First, we introduce a novel and coherent quantification of risk, namely composite risk, which quantifies joint effect of aleatory and epistemic risk during the learning process. Existing works considered either aleatory or epistemic risk individually, or an additive combination of the two. We prove that the additive formulation is a particular case of the composite risk when the epistemic risk measure is replaced with expectation. Thus, the composite risk provides an estimate more sensitive to both aleatory and epistemic sources of uncertainties than the individual and additive formulations. Following that, we propose to use a bootstrapping method, SENTINEL-K, for performing distributional RL. SENTINEL-K uses an ensemble of $K$ learners to estimate the return distribution. We use the Follow The Regularised Leader (FTRL) to aggregate the return distributions of $K$ learners and to estimate the composite risk. We experimentally verify that SENTINEL-K estimates the return distribution better, and while used with composite risk estimate, demonstrates better risk-sensitive performance than state-of-the-art risk-sensitive and distributional RL algorithms.

翻译：在本文中,我们考虑在基于模型的强化学习(RL)中进行对风险敏感的顺序决策。我们的贡献是双重的。首先,我们引入了一种新的和连贯的风险量化方法,即综合风险,对学习过程中的感官和感官风险的共同影响进行量化。现有的工程有的被认为是个别的感官或感官风险,有的则是两种风险的添加组合组合。我们证明,当以预期值取代共感风险措施时,添加剂是综合风险的一个特殊案例。因此,综合风险提供了一种比个人和添加剂对不确定性的感官和集中来源更为敏感的估计。随后,我们提议使用SentINEL-K这一示意方法进行分发RL. SentINEL-K,使用一个以美元为单位的学习者来估计回流分布。我们使用“随附者领袖”(FTRL)来汇总以美元为对象的回流风险分布,并估计综合风险。我们实验性地核实SentINEL-K对回流风险的敏感度估计比使用的综合风险预测更精确的回流风险分布,同时用“RentINEL-L”的合成风险预测,同时以更好地展示了Sental-assal-ldal-assal-ldal-dal-dal-dal-ldal-dal-sal-sal disl was disl disl disl disl disl disl disl disl disl disml disml disml disml dislv)。