We study the regret guarantee for risk-sensitive reinforcement learning (RSRL) via distributional reinforcement learning (DRL) methods. In particular, we consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return. We identify a key property of the EntRM, the monotonicity-preserving property, which enables the risk-sensitive distributional dynamic programming framework. We then propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one. We prove that both of them attain $\tilde{\mathcal{O}}(\frac{\exp(|\beta| H)-1}{|\beta|H}H\sqrt{HS^2AT})$ regret upper bound, where $S$ is the number of states, $A$ the number of states, $H$ the time horizon and $T$ the number of total time steps. It matches RSVI2 proposed in \cite{fei2021exponential} with a much simpler regret analysis. To the best of our knowledge, this is the first regret analysis of DRL, which bridges DRL and RSRL in terms of sample complexity. Finally, we improve the existing lower bound by proving a tighter bound of $\Omega(\frac{\exp(\beta H/6)-1}{\beta H}H\sqrt{SAT})$ for $\beta>0$ case, which recovers the tight lower bound $\Omega(H\sqrt{SAT})$ in the risk-neutral setting.
翻译:我们研究对风险敏感的强化学习(RSRL)通过分配强化学习(DRL)方法进行风险敏感的强化学习(RSRL)的遗憾保证。 特别是, 我们考虑以返回的宏风险衡量( EttRM ) 为目标的有限分数Markov 决策程序。 我们确定EntRM 的关键属性, 单调式保留属性, 使对风险敏感的分布动态编程框架成为可能。 然后我们提出两种新的DRL算法, 通过两种不同的方案, 包括无模型的一和模型的一, 实现乐观。 我们证明, 两者都达到了 $\ mathcal{O} (fraceta\\\ b ⁇ H)-1\ beta} H\ {beta\ b> h\ sqrt} h\ sqnority, 美元是州的数目, $A$HHL 范围, 时间范围 和 美元总时间步骤。 它与在\ cite 2021 Explic 中提议的RS ral deal deal deals the the the hest RDRrmaxL crial deal deal deal deal deal deal deal deal deal deal deal deal deg ex ex ex ex.