We study minimax methods for off-policy evaluation (OPE) using value functions and marginalized importance weights. Despite that they hold promises of overcoming the exponential variance in traditional importance sampling, several key problems remain: (1) They require function approximation and are generally biased. For the sake of trustworthy OPE, is there anyway to quantify the biases? (2) They are split into two styles ("weight-learning" vs "value-learning"). Can we unify them? In this paper we answer both questions positively. By slightly altering the derivation of previous methods (one from each style; Uehara et al., 2020), we unify them into a single value interval that comes with a special type of double robustness: when either the value-function or the importance-weight class is well specified, the interval is valid and its length quantifies the misspecification of the other class. Our interval also provides a unified view of and new insights to some recent methods, and we further explore the implications of our results on exploration and exploitation in off-policy policy optimization with insufficient data coverage.
翻译:我们用价值函数和边缘化重要性加权来研究非政策性评估的小型方法(OPE),使用价值函数和边缘化重要性加权数。尽管它们承诺要克服传统重要性抽样中的指数差异,但仍然存在几个关键问题:(1) 它们需要功能近似,而且普遍存在偏差。为了值得信赖的OPE,它们是否真的可以量化偏差?(2) 它们被分为两种形式(“加权学习”与“价值学习”)。我们能否在本文中将它们区分为两个问题?通过稍微改变以前方法的衍生方式(每个类型一个;Uehara等人,2020年),我们将它们统一成一个单一的价值间隔,以特殊的双强度形式出现:(1) 当价值功能或重要性等级类别得到明确确定时,间隔是有效的,其长度可以量化其他类别的具体错误。我们的间隔也为最近的一些方法提供了统一的观点和新的见解,我们进一步探讨我们的结果对非政策性政策优化的探索和开发的影响,而数据覆盖面不足。