Even when unable to run experiments, practitioners can evaluate prospective policies, using previously logged data. However, while the bandits literature has adopted a diverse set of objectives, most research on off-policy evaluation to date focuses on the expected reward. In this paper, we introduce Lipschitz risk functionals, a broad class of objectives that subsumes conditional value-at-risk (CVaR), variance, mean-variance, many distorted risks, and CPT risks, among others. We propose Off-Policy Risk Assessment (OPRA), a framework that first estimates a target policy's CDF and then generates plugin estimates for any collection of Lipschitz risks, providing finite sample guarantees that hold simultaneously over the entire class. We instantiate OPRA with both importance sampling and doubly robust estimators. Our primary theoretical contributions are (i) the first uniform concentration inequalities for both CDF estimators in contextual bandits and (ii) error bounds on our Lipschitz risk estimates, which all converge at a rate of $O(1/\sqrt{n})$.
翻译:即使无法进行实验,执业者也可以利用先前记录的数据评估预期政策。然而,虽然土匪文献采用了一套不同的目标,但迄今为止关于非政策评价的大多数研究都侧重于预期的奖励。在本文件中,我们引入了利普西茨风险功能,即一系列广泛的目标,包括有条件风险价值(CVaR)、差异、平均差异、许多扭曲的风险和CPT风险。我们提议了“离岸风险评估”,这个框架首先估算目标政策CDF,然后为收集的任何利普西茨风险生成插件估计数,提供对整个阶层同时持有的有限抽样保证。我们用重要抽样和双重强势的估算器对普利普西茨风险进行即时加。我们的主要理论贡献是:(一) 背景强盗中CDF估计者首次统一集中不平等,以及(二) 利普西茨风险估计的错误界限,全部一致为$(1/ sqrt{n}。