In many risk-aware and multi-objective reinforcement learning settings, the utility of the user is derived from the single execution of a policy. In these settings, making decisions based on the average future returns is not suitable. For example, in a medical setting a patient may only have one opportunity to treat their illness. When making a decision, just the expected return -- known in reinforcement learning as the value -- cannot account for the potential range of adverse or positive outcomes a decision may have. Our key insight is that we should use the distribution over expected future returns differently to represent the critical information that the agent requires at decision time. In this paper, we propose Distributional Monte Carlo Tree Search, an algorithm that learns a posterior distribution over the utility of the different possible returns attainable from individual policy executions, resulting in good policies for both risk-aware and multi-objective settings. Moreover, our algorithm outperforms the state-of-the-art in multi-objective reinforcement learning for the expected utility of the returns.
翻译:在许多风险意识和多目标强化学习环境中,用户的效用来自单项执行一项政策。在这些环境中,根据平均未来回报率作出决定是不合适的。例如,在医疗环境中,病人可能只有一次治疗其疾病的机会。在做决定时,在强化学习中被称为价值的预期回报率无法说明一个决定可能产生的不利或积极结果的范围。我们的关键见解是,我们应该使用对预期未来回报值的分布方式不同,以代表代理人在决策时间所要求的关键信息。我们在此文件中提议,Smitteal Monte Carlo树搜索是一种算法,该算法可以学习后方分配方式,而不是个别政策执行后可能实现的不同回报率的效用,从而形成良好的风险意识和多目标环境政策。此外,我们的算法超越了多目标强化学习中最新的多目标学习方法,以达到预期回报率的效用。