We propose a novel framework to solve risk-sensitive reinforcement learning (RL) problems where the agent optimises time-consistent dynamic spectral risk measures. Based on the notion of conditional elicitability, our methodology constructs (strictly consistent) scoring functions that are used as penalizers in the estimation procedure. Our contribution is threefold: we (i) devise an efficient approach to estimate a class of dynamic spectral risk measures with deep neural networks, (ii) prove that these dynamic spectral risk measures may be approximated to any arbitrary accuracy using deep neural networks, and (iii) develop a risk-sensitive actor-critic algorithm that uses full episodes and does not require any additional nested transitions. We compare our conceptually improved reinforcement learning algorithm with the nested simulation approach and illustrate its performance in two settings: statistical arbitrage and portfolio allocation on both simulated and real data.
翻译:我们提出了一种新的框架来解决风险敏感的强化学习问题,其中代理人优化时间一致的动态光谱风险度量。基于条件可求度的概念,我们的方法构造了(严格一致的)评分函数,作为估计过程中的惩罚器。我们的贡献有三个方面:我们(i)设计了一种高效的方法来使用深度神经网络估计一类动态光谱风险度量,(ii)证明了这些动态光谱风险度量可以使用深度神经网络以任意精度逼近,并且(iii)开发了一个风险敏感的演员-评论家算法,使用完整的情节,并不需要任何额外的嵌套转换。我们将我们的概念改进的强化学习算法与嵌套仿真方法进行比较,并在两个设置中说明了其在统计套利和投资组合分配方面的性能,包括模拟数据和真实数据。