We propose a novel framework to solve risk-sensitive reinforcement learning (RL) problems where the agent optimises time-consistent dynamic spectral risk measures. Based on the notion of conditional elicitability, our methodology constructs (strictly consistent) scoring functions that are used as penalizers in the estimation procedure. Our contribution is threefold: we (i) devise an efficient approach to estimate a class of dynamic spectral risk measures with deep neural networks, (ii) prove that these dynamic spectral risk measures may be approximated to any arbitrary accuracy using deep neural networks, and (iii) develop a risk-sensitive actor-critic algorithm that uses full episodes and does not require any additional nested transitions. We compare our conceptually improved reinforcement learning algorithm with the nested simulation approach and illustrate its performance in two settings: statistical arbitrage and portfolio allocation on both simulated and real data.
翻译:我们提出了一个新框架,以解决在代理人选择符合时间的动态光谱风险措施的情况下对风险敏感的强化学习(RL)问题。根据有条件的诱导概念,我们的方法构建了(严格一致的)评分功能,在估算程序中用作惩罚者。我们的贡献有三重:我们(一) 设计一个有效的方法,用深层神经网络来估计一组动态光谱风险措施;(二) 证明这些动态光谱风险措施可能与使用深层神经网络的任何任意精确度相近,以及(三) 开发一种对风险敏感的演算法,使用完整的场面,而不需要任何其他嵌套式过渡。我们将我们经过概念改进的强化学习算法与嵌套式模拟方法进行比较,并展示其在两种环境下的性能:模拟数据和真实数据的统计套利和组合分配。