The classic objective in a reinforcement learning (RL) problem is to find a policy that minimizes, in expectation, a long-run objective such as the infinite-horizon cumulative discounted or long-run average cost. In many practical applications, optimizing the expected value alone is not sufficient, and it may be necessary to include a risk measure in the optimization process, either in the objective or as a constraint. Various risk measures have been proposed in the literature, e.g., variance, exponential utility, percentile performance, chance constraints, value at risk (quantile), conditional value-at-risk, coherent risk measure, prospect theory and its later enhancement, cumulative prospect theory. In this article, we focus on the combination of risk criteria and reinforcement learning in a constrained optimization framework, i.e., a setting where the goal to find a policy that optimizes the usual objective of infinite-horizon discounted/average cost, while ensuring that an explicit risk constraint is satisfied. We introduce the risk-constrained RL framework, cover popular risk measures based on variance, conditional value-at-risk, and chance constraints, and present a template for a risk-sensitive RL algorithm. Next, we study risk-sensitive RL with the objective of minimizing risk in an unconstrained framework, and cover cumulative prospect theory and coherent risk measures as special cases. We survey some of the recent work on this topic, covering problems encompassing discounted cost, average cost, and stochastic shortest path settings, together with the aforementioned risk measures, in constrained as well as unconstrained frameworks. This non-exhaustive survey is aimed at giving a flavor of the challenges involved in solving risk-sensitive RL problems, and outlining some potential future research directions.
翻译:强化学习(RL)问题的经典目标是找到一种政策,在预期中最大限度地减少长期目标,如无穷无尽的累计累积贴现成本或长期平均成本。在许多实际应用中,仅仅优化预期值是不够的,可能需要在优化过程中,无论是在目标中还是作为制约因素纳入风险度量。文献中提出了各种风险度量,例如差异、指数效用、百分率业绩、机会限制、风险中的最低风险(敏感度)、有条件风险值、一致的风险度度度度、前景理论及其后来的增强、累积前景挑战等。在本篇文章中,我们侧重于风险标准的结合和在有限的优化框架内强化特殊学习,即,可能有必要将风险度度度值单纳入优化进程,在文献中提出了各种风险度量度度度度,同时确保明确的风险度限制得到满足。我们引入了风险控制RL框架,涵盖了基于差异、有条件的值和机会制约的公众风险度度度度,在风险中,我们侧重于将风险度度度度度度值和风险度度度度度度度度调查中提出了一个模板,作为风险度度风险度风险度度度度风险度风险度风险度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度度