While risk-neutral reinforcement learning has shown experimental success in a number of applications, it is well-known to be non-robust with respect to noise and perturbations in the parameters of the system. For this reason, risk-sensitive reinforcement learning algorithms have been studied to introduce robustness and sample efficiency, and lead to better real-life performance. In this work, we introduce new model-free risk-sensitive reinforcement learning algorithms as variations of widely-used Policy Gradient algorithms with similar implementation properties. In particular, we study the effect of exponential criteria on the risk-sensitivity of the policy of a reinforcement learning agent, and develop variants of the Monte Carlo Policy Gradient algorithm and the online (temporal-difference) Actor-Critic algorithm. Analytical results showcase that the use of exponential criteria generalize commonly used ad-hoc regularization approaches. The implementation, performance, and robustness properties of the proposed methods are evaluated in simulated experiments.
翻译:尽管风险中性的强化学习在许多应用中已经取得了实验成功,但是我们知道它对于系统参数中的噪声和扰动是不稳健的。因此,风险敏感的强化学习算法就被研究用来提高鲁棒性和样本效率,从而实现更好的现实性能。在这项工作中,我们将广泛使用的策略梯度算法变体引入了新的基于模型的风险敏感的强化学习算法。特别是,我们研究了指数准则对强化学习代理策略的风险敏感性的影响,并开发了蒙特卡罗策略梯度算法和在线(时序差分)Actor-Critic算法的变体。分析结果展示了指数准则泛化到通常使用的特定正则化方法的广泛性。所提出的方法的实现、性能和鲁棒性在模拟实验中得到了评估。