Classical reinforcement learning (RL) techniques are generally concerned with the design of decision-making policies driven by the maximisation of the expected outcome. Nevertheless, this approach does not take into consideration the potential risk associated with the actions taken, which may be critical in certain applications. To address that issue, the present research work introduces a novel methodology based on distributional RL to derive sequential decision-making policies that are sensitive to the risk, the latter being modelled by the tail of the return probability distribution. The core idea is to replace the $Q$ function generally standing at the core of learning schemes in RL by another function taking into account both the expected return and the risk. Named the risk-based utility function $U$, it can be extracted from the random return distribution $Z$ naturally learnt by any distributional RL algorithm. This enables to span the complete potential trade-off between risk minimisation and expected return maximisation, in contrast to fully risk-averse methodologies. Fundamentally, this research yields a truly practical and accessible solution for learning risk-sensitive policies with minimal modification to the distributional RL algorithm, and with an emphasis on the interpretability of the resulting decision-making process.
翻译:经典强化学习(RL)技术一般涉及由预期结果最大化所驱动的决策政策的设计,然而,这一方法没有考虑到所采取的行动可能带来的潜在风险,而这些行动在某些应用中可能至关重要。为解决这一问题,目前的研究工作采用了基于分配RL的新方法,以得出对风险敏感的顺序决策政策,后者以返回概率分布的尾巴为模型。核心思想是,考虑到预期回报和风险,用另一个功能取代通常处于学习计划核心的Q美元职能。这个基于风险的效用功能命名为U$,可以从任何分配RL算法所自然学的随机回报分布中抽取Z$,从而能够跨越风险最小化和预期回报最大化之间的完全潜在权衡,而与完全反风险的方法形成对比。从根本上说,这一研究为学习风险敏感政策提供了真正实际和容易获得的解决办法,对分配RL算法进行最低限度的修改,并强调由此产生的决策过程的可解释性。