We study risk-sensitive reinforcement learning (RL) based on the entropic risk measure. Although existing works have established non-asymptotic regret guarantees for this problem, they leave open an exponential gap between the upper and lower bounds. We identify the deficiencies in existing algorithms and their analysis that result in such a gap. To remedy these deficiencies, we investigate a simple transformation of the risk-sensitive Bellman equations, which we call the exponential Bellman equation. The exponential Bellman equation inspires us to develop a novel analysis of Bellman backup procedures in risk-sensitive RL algorithms, and further motivates the design of a novel exploration mechanism. We show that these analytic and algorithmic innovations together lead to improved regret upper bounds over existing ones.
翻译:我们根据昆虫风险评估措施研究对风险敏感的强化学习(RL),尽管现有的工程已经为这一问题建立了非无药可救的遗憾保障,但它们在上界和下界之间留下了一个指数性差距。我们找出了现有算法的缺陷及其导致这种差距的分析。为了纠正这些缺陷,我们调查了对风险敏感的贝尔曼方程式的简单转换,我们称之为指数性贝尔曼方程式。指数性贝尔曼方程式激励我们开发了对风险敏感RL方程式中的贝尔曼备份程序的新分析,并进一步激励了新探索机制的设计。我们表明,这些分析和算法创新共同导致对现有方程式的遗憾增加。