对政策优化采取对风险有敏感认识的办法 (A Risk-Sensitive Approach to Policy Optimization)

Standard deep reinforcement learning (DRL) aims to maximize expected reward, considering collected experiences equally in formulating a policy. This differs from human decision-making, where gains and losses are valued differently and outlying outcomes are given increased consideration. It also fails to capitalize on opportunities to improve safety and/or performance through the incorporation of distributional context. Several approaches to distributional DRL have been investigated, with one popular strategy being to evaluate the projected distribution of returns for possible actions. We propose a more direct approach whereby risk-sensitive objectives, specified in terms of the cumulative distribution function (CDF) of the distribution of full-episode rewards, are optimized. This approach allows for outcomes to be weighed based on relative quality, can be used for both continuous and discrete action spaces, and may naturally be applied in both constrained and unconstrained settings. We show how to compute an asymptotically consistent estimate of the policy gradient for a broad class of risk-sensitive objectives via sampling, subsequently incorporating variance reduction and regularization measures to facilitate effective on-policy learning. We then demonstrate that the use of moderately "pessimistic" risk profiles, which emphasize scenarios where the agent performs poorly, leads to enhanced exploration and a continual focus on addressing deficiencies. We test the approach using different risk profiles in six OpenAI Safety Gym environments, comparing to state of the art on-policy methods. Without cost constraints, we find that pessimistic risk profiles can be used to reduce cost while improving total reward accumulation. With cost constraints, they are seen to provide higher positive rewards than risk-neutral approaches at the prescribed allowable cost.

翻译：标准深度强化学习(DRL)旨在尽量扩大预期的奖励,在制订政策时同等考虑所收集的经验;这不同于人类决策,因为对人类决策的收益和损失有不同的价值,偏差结果得到更多的考虑;它也没有利用机会,通过纳入分配环境来改善安全和(或)业绩;对分配式DRL的几种方法进行了调查,其中一种流行的战略是通过抽样评估预测的回报分配情况来评价可能采取的行动的预测回报分配情况;我们建议一种更直接的办法,即根据分配全周期性奖励的累积分配功能(CDF),优化风险敏感目标;这种办法允许根据相对质量衡量结果,可以将结果加以权衡,可以用于连续和分散的行动空间,也可以自然地在受限制和不受限制的环境中运用。我们展示了如何通过抽样评估对政策梯度进行非随机一致的估计,随后纳入差异减少和规范化措施,以促进有效的政策学习。我们然后表明,使用适度的“悲观”风险风险简介,根据相对质量进行权衡,可以将结果用于连续和分散的行动空间风险简介,而我们用不同的风险简介来比较风险简介。

相关内容

Continuity

关注 4

让 iOS 8 和 OS X Yosemite 无缝切换的一个新特性。 > Apple products have always been designed to work together beautifully. But now they may really surprise you. With iOS 8 and OS X Yosemite, you’ll be able to do more wonderful things than ever before.

Source: Apple - iOS 8

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

深度学习优化算法，73页ppt，Optimization Algorithms on Deep Learning

专知会员服务

135+阅读 · 2021年6月16日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

81+阅读 · 2020年7月26日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

95+阅读 · 2020年3月12日