In this paper, we study a novel episodic risk-sensitive Reinforcement Learning (RL) problem, named Iterated CVaR RL, where the objective is to maximize the tail of the reward-to-go at each step. Different from existing risk-aware RL formulations, Iterated CVaR RL focuses on safety-at-all-time, by enabling the agent to tightly control the risk of getting into catastrophic situations at each stage, and is applicable to important risk-sensitive tasks that demand strong safety guarantees throughout the decision process, such as autonomous driving, clinical treatment planning and robotics. We investigate Iterated CVaR RL with two performance metrics, i.e., Regret Minimization and Best Policy Identification. For both metrics, we design efficient algorithms ICVaR-RM and ICVaR-BPI, respectively, and provide matching upper and lower bounds with respect to the number of episodes $K$. We also investigate an interesting limiting case of Iterated CVaR RL, called Worst Path RL, where the objective becomes to maximize the minimum possible cumulative reward, and propose an efficient algorithm with constant upper and lower bounds. Finally, the techniques we develop for bounding the change of CVaR due to the value function shift and decomposing the regret via a distorted visitation distribution are novel and can find applications in other risk-sensitive online learning problems.
翻译:在本文中,我们研究了一个新颖的、对风险敏感的强化学习(RL)问题,名为 " 迭代CVAR RL ",目的是在每一步中最大限度地增加奖励到实现的尾尾部。与现有的风险觉悟RL配方不同, " 迭代CVAR RL " 侧重于无时无刻不在的安全,使代理方能够严格控制在每个阶段陷入灾难风险的风险,并适用于需要在整个决策过程中提供强有力的安全保障的重要风险敏感任务,如自主驾驶、临床治疗规划和机器人。我们用两种性能衡量标准,即 " 尽量减少和最佳政策识别 " 来调查 " 迭代CVAR RL " 。对于这两种标准, " 迭代 " CVAR-RMM " 和 " ICVAR-BPI " 分别设计高效的算法,使代理方能严格控制在每个阶段进入灾难性的灾难风险风险风险风险风险风险风险风险风险风险风险,同时提供与金额相匹配的上限和下限。我们还调查一个有趣的、称为 " 最坏路径R " 的RL " 案件,目标是通过两种方法,最终将风险推算出一个尽可能高的递增分算。