While risk-neutral reinforcement learning has shown experimental success in a number of applications, it is well-known to be non-robust with respect to noise and perturbations in the parameters of the system. For this reason, risk-sensitive reinforcement learning algorithms have been studied to introduce robustness and sample efficiency, and lead to better real-life performance. In this work, we introduce new model-free risk-sensitive reinforcement learning algorithms as variations of widely-used Policy Gradient algorithms with similar implementation properties. In particular, we study the effect of exponential criteria on the risk-sensitivity of the policy of a reinforcement learning agent, and develop variants of the Monte Carlo Policy Gradient algorithm and the online (temporal-difference) Actor-Critic algorithm. Analytical results showcase that the use of exponential criteria generalize commonly used ad-hoc regularization approaches. The implementation, performance, and robustness properties of the proposed methods are evaluated in simulated experiments.
翻译:风险中性强化学习在一些应用中显示出实验性的成功,但众所周知,在系统参数中的噪音和扰动方面,这种学习是非激烈的,因此,对风险敏感强化学习算法进行了研究,以引入稳健性和抽样效率,并导致更好的真实生活表现。在这项工作中,我们引入了新的无模式、对风险敏感强化学习算法,作为广泛使用的具有类似执行属性的政策梯度算法的变异。特别是,我们研究了指数性标准对强化学习代理法政策的风险敏感性的影响,并开发了蒙特卡洛政策梯度算法和在线(时-时-时)动算法的变体。分析结果显示,指数性标准的使用概括了常用的自动规范化方法。模拟实验对拟议方法的实施、绩效和稳健性特性进行了评估。