场外强盗的非政策风险评估 (Off-Policy Risk Assessment in Contextual Bandits)

To evaluate prospective contextual bandit policies when experimentation is not possible, practitioners often rely on off-policy evaluation, using data collected under a behavioral policy. While off-policy evaluation studies typically focus on the expected return, practitioners often care about other functionals of the reward distribution (e.g., to express aversion to risk). In this paper, we first introduce the class of Lipschitz risk functionals, which subsumes many common functionals, including variance, mean-variance, and conditional value-at-risk (CVaR). For Lipschitz risk functionals, the error in off-policy risk estimation is bounded by the error in off-policy estimation of the cumulative distribution function (CDF) of rewards. Second, we propose Off-Policy Risk Assessment (OPRA), an algorithm that (i) estimates the target policy's CDF of rewards; and (ii) generates a plug-in estimate of the risk. Given a collection of Lipschitz risk functionals, OPRA provides estimates for each with corresponding error bounds that hold simultaneously. We analyze both importance sampling and variance-reduced doubly robust estimators of the CDF. Our primary theoretical contributions are (i) the first concentration inequalities for both types of CDF estimators and (ii) guarantees on our Lipschitz risk functional estimates, which converge at a rate of O(1/\sqrt{n}). For practitioners, OPRA offers a practical solution for providing high-confidence assessments of policies using a collection of relevant metrics.

翻译：在试验不可能时,实践者往往依赖非政策评价,使用根据行为政策收集的数据来评价潜在的土匪政策。虽然非政策评价研究通常侧重于预期回报,但实践者往往关心奖励分配的其他功能(如表示厌恶风险)。在本文中,我们首先介绍利普西茨风险功能类别,该类别包含许多共同功能,包括差异、平均差异和有条件风险价值。利普西茨风险功能中,对利普西茨风险功能的估算有误。利普西茨风险估计有误,但政策外风险估计的错误受对奖励累积分配功能(CDF)的超出政策估计的错误所约束。第二,我们提议非政策风险评估(OPRA)这一算法(一)估算目标政策回报的核心功能;和(二)对风险进行插插座估计。由于收集了利普西茨风险功能功能,普利普西茨风险功能(CVARRA)为每个具有相应误差的功能。我们分析了重要抽样和差异调整后,对累积分配的累积功能性评估(OPR西)的理论评估是提供核心风险率(CDF的理论评估)。