通过人力投入进行强有力的非政策评价 (Towards Robust Off-Policy Evaluation via Human Inputs)

from arxiv, 10 pages, 5 figures, 1 table. Appeared at AIES '22: Proceedings of the 2022 AAAI/ACM Conference on AI, Ethics, and Society. Expanded version of arXiv:2103.15933

Off-policy Evaluation (OPE) methods are crucial tools for evaluating policies in high-stakes domains such as healthcare, where direct deployment is often infeasible, unethical, or expensive. When deployment environments are expected to undergo changes (that is, dataset shifts), it is important for OPE methods to perform robust evaluation of the policies amidst such changes. Existing approaches consider robustness against a large class of shifts that can arbitrarily change any observable property of the environment. This often results in highly pessimistic estimates of the utilities, thereby invalidating policies that might have been useful in deployment. In this work, we address the aforementioned problem by investigating how domain knowledge can help provide more realistic estimates of the utilities of policies. We leverage human inputs on which aspects of the environments may plausibly change, and adapt the OPE methods to only consider shifts on these aspects. Specifically, we propose a novel framework, Robust OPE (ROPE), which considers shifts on a subset of covariates in the data based on user inputs, and estimates worst-case utility under these shifts. We then develop computationally efficient algorithms for OPE that are robust to the aforementioned shifts for contextual bandits and Markov decision processes. We also theoretically analyze the sample complexity of these algorithms. Extensive experimentation with synthetic and real world datasets from the healthcare domain demonstrates that our approach not only captures realistic dataset shifts accurately, but also results in less pessimistic policy evaluations.

翻译：离岸评估方法(OPE)是评价保健等高目标领域政策的关键工具,因为直接部署往往不可行、不道德或费用昂贵。当部署环境预期会发生改变时(即数据集变换),OPE方法必须对这种变化中的政策进行强有力的评价。现有方法考虑到对大规模变化的稳健性,这种变化可以任意改变环境的任何可见属性。这往往导致对公用事业的高度悲观估计,从而否定可能有用的政策。在这项工作中,我们通过调查域知识如何能够帮助提供更现实的政策效用估计数来解决上述问题。当部署环境预期会发生改变(即数据集变)时,OPE方法必须对这种变化中的政策进行有力的评价。具体地说,我们提出了一个新的框架,即Robust OPE(ROPE),它考虑到根据用户投入对数据进行的一系列差异变化,并估计在这些变化中最坏的效用。我们随后又为OPE系统设计了高效的算法,该方法对于政策的准确性估计是无法对政策效用作出更现实的估算的。我们利用了各种环境变化的人类投入,我们利用了环境模型分析方法来分析我们的真实数据,而没有进行深度的模型分析。我们又没有分析了这些精确的模型分析。