We develop an approach for solving time-consistent risk-sensitive stochastic optimization problems using model-free reinforcement learning (RL). Specifically, we assume agents assess the risk of a sequence of random variables using dynamic convex risk measures. We employ a time-consistent dynamic programming principle to determine the value of a particular policy, and develop policy gradient update rules that aid in obtaining optimal policies. We further develop an actor-critic style algorithm using neural networks to optimize over policies. Finally, we demonstrate the performance and flexibility of our approach by applying it to three optimization problems: statistical arbitrage trading strategies, financial hedging, and obstacle avoidance robot control.
翻译:我们利用无模型强化学习(RL)来制定解决时间一致、风险敏感、随机优化问题的方法。 具体地说,我们假定代理商利用动态convex风险措施来评估随机变量序列的风险。 我们采用一个时间一致的动态方案规划原则来确定特定政策的价值,并制订政策梯度更新规则以帮助获得最佳政策。 我们进一步开发一个使用神经网络优化政策的行为者-批评风格算法。 最后,我们通过将它应用于三个优化问题来展示我们的方法的绩效和灵活性:统计套利交易战略、金融套期保值和障碍避免机器人控制。