We develop an approach for solving time-consistent risk-sensitive stochastic optimization problems using model-free reinforcement learning (RL). Specifically, we assume agents assess the risk of a sequence of random variables using dynamic convex risk measures. We employ a time-consistent dynamic programming principle to determine the value of a particular policy, and develop policy gradient update rules. We further develop an actor-critic style algorithm using neural networks to optimize over policies. Finally, we demonstrate the performance and flexibility of our approach by applying it to optimization problems in statistical arbitrage trading and obstacle avoidance robot control.
翻译:我们利用无模型强化学习(RL)来制定解决时间一致、风险敏感、随机优化问题的方法。具体地说,我们假定代理商利用动态convex风险措施评估随机变量序列的风险。我们采用时间一致的动态方案规划原则来确定特定政策的价值,并制定政策梯度更新规则。我们进一步开发一种使用神经网络优化政策的行为体-批评风格算法。最后,我们通过将它应用于统计套利交易和避免障碍机器人控制中的优化问题,展示了我们方法的绩效和灵活性。