Standard deep reinforcement learning (DRL) aims to maximize expected reward, considering collected experiences equally in formulating a policy. This differs from human decision-making, where gains and losses are valued differently and outlying outcomes are given increased consideration. It also fails to capitalize on opportunities to improve safety and/or performance through the incorporation of distributional context. Several approaches to distributional DRL have been investigated, with one popular strategy being to evaluate the projected distribution of returns for possible actions. We propose a more direct approach whereby risk-sensitive objectives, specified in terms of the cumulative distribution function (CDF) of the distribution of full-episode rewards, are optimized. This approach allows for outcomes to be weighed based on relative quality, can be used for both continuous and discrete action spaces, and may naturally be applied in both constrained and unconstrained settings. We show how to compute an asymptotically consistent estimate of the policy gradient for a broad class of risk-sensitive objectives via sampling, subsequently incorporating variance reduction and regularization measures to facilitate effective on-policy learning. We then demonstrate that the use of moderately "pessimistic" risk profiles, which emphasize scenarios where the agent performs poorly, leads to enhanced exploration and a continual focus on addressing deficiencies. We test the approach using different risk profiles in six OpenAI Safety Gym environments, comparing to state of the art on-policy methods. Without cost constraints, we find that pessimistic risk profiles can be used to reduce cost while improving total reward accumulation. With cost constraints, they are seen to provide higher positive rewards than risk-neutral approaches at the prescribed allowable cost.
翻译:标准深度强化学习(DRL)旨在尽量扩大预期的奖励,在制订政策时同等考虑所收集的经验;这不同于人类决策,因为对人类决策的收益和损失有不同的价值,偏差结果得到更多的考虑;它也没有利用机会,通过纳入分配环境来改善安全和(或)业绩;对分配式DRL的几种方法进行了调查,其中一种流行的战略是通过抽样评估预测的回报分配情况来评价可能采取的行动的预测回报分配情况;我们建议一种更直接的办法,即根据分配全周期性奖励的累积分配功能(CDF),优化风险敏感目标;这种办法允许根据相对质量衡量结果,可以将结果加以权衡,可以用于连续和分散的行动空间,也可以自然地在受限制和不受限制的环境中运用。我们展示了如何通过抽样评估对政策梯度进行非随机一致的估计,随后纳入差异减少和规范化措施,以促进有效的政策学习。我们然后表明,使用适度的“悲观”风险风险简介,根据相对质量进行权衡,可以将结果用于连续和分散的行动空间风险简介,而我们用不同的风险简介来比较风险简介。