We present a mean-variance policy iteration (MVPI) framework for risk-averse control in a discounted infinite horizon MDP optimizing the variance of a per-step reward random variable. MVPI enjoys great flexibility in that any policy evaluation method and risk-neutral control method can be dropped in for risk-averse control off the shelf, in both on- and off-policy settings. This flexibility reduces the gap between risk-neutral control and risk-averse control and is achieved by working on a novel augmented MDP directly. We propose risk-averse TD3 as an example instantiating MVPI, which outperforms vanilla TD3 and many previous risk-averse control methods in challenging Mujoco robot simulation tasks under a risk-aware performance metric. This risk-averse TD3 is the first to introduce deterministic policies and off-policy learning into risk-averse reinforcement learning, both of which are key to the performance boost we show in Mujoco domains.
翻译:我们提出了一个风险反向控制平均变化政策循环框架(MVPI),用于在折扣的无限地平线上优化每步奖励随机变数的差异。 MSPI拥有极大的灵活性,因为任何政策评价方法和风险中和控制方法都可以在政策上和非政策环境中投放到架子上进行风险反控。这种灵活性缩小了风险中和控制与风险反控制之间的差距,并且通过直接开发新颖的扩大 MDP而实现。我们建议风险反向TD3作为快速MPI的一个例子,它比Vanilla TD3和许多先前在风险中测试Mujoco机器人模拟任务时采用的风险反向控制方法都更强。这种风险偏向TD3是第一个在风险中和不政策学习风险反强化学习中引入威慑政策和非政策,这两者都是我们在Mujoco地区展示的性能增强的关键。