The optimized certainty equivalent (OCE) is a family of risk measures that cover important examples such as entropic risk, conditional value-at-risk and mean-variance models. In this paper, we propose a new episodic risk-sensitive reinforcement learning formulation based on tabular Markov decision processes with recursive OCEs. We design an efficient learning algorithm for this problem based on value iteration and upper confidence bound. We derive an upper bound on the regret of the proposed algorithm, and also establish a minimax lower bound. Our bounds show that the regret rate achieved by our proposed algorithm has optimal dependence on the number of episodes and the number of actions.
翻译:最优化的确定性等同(OCE)是一套风险措施,涵盖重要的例子,如:引温风险、有条件的风险价值和中位差模式。在本文件中,我们提议根据表单式Markov决定程序,采用循环的OCE,采用新的对风险敏感的强化学习方法。我们根据价值迭代和上层信任约束,为这一问题设计一个高效的学习算法。我们从提议的算法的遗憾中获取上层界限,并建立一个最小值的较低界限。我们的界限表明,我们提议的算法所实现的遗憾率最充分地依赖于事件数量和行动数量。