Risk management is critical in decision making, and mean-variance (MV) trade-off is one of the most common criteria. However, in reinforcement learning (RL) for sequential decision making under uncertainty, most of the existing methods for MV control suffer from computational difficulties caused by the double sampling problem. In this paper, in contrast to strict MV control, we consider learning MV efficient policies that achieve Pareto efficiency regarding MV trade-off. To achieve this purpose, we train an agent to maximize the expected quadratic utility function, a common objective of risk management in finance and economics. We call our approach direct expected quadratic utility maximization (EQUM). The EQUM does not suffer from the double sampling issue because it does not include gradient estimation of variance. We confirm that the maximizer of the objective in the EQUM directly corresponds to an MV efficient policy under a certain condition. We conduct experiments with benchmark settings to demonstrate the effectiveness of the EQUM.
翻译:风险管理在决策中至关重要,平均差(MV)权衡是最常见的标准之一。然而,在强化学习(RL)中,在不确定情况下连续决策的强化学习(RL)中,大多数现有的MV控制方法都因双重抽样问题造成的计算困难。在本文件中,与严格的MV控制相比,我们考虑学习MV高效政策,使Pareto在MV交易方面实现效率。为了达到这一目的,我们培训了一名代理人,以最大限度地发挥预期的四级公用事业功能,这是金融和经济风险管理的一个共同目标。我们称之为我们的方法直接预期的四级效用最大化(EQUM)。EQUM并不受到双重抽样问题的影响,因为其中不包括差异的梯度估计。我们确认EQUM目标的最大化在一定条件下直接与有效的MV政策相对应。我们用基准环境进行实验,以证明EQUM的有效性。