Off-policy evaluation often refers to two related tasks: estimating the expected return of a policy and estimating its value function (or other functions of interest, such as density ratios). While recent works on marginalized importance sampling (MIS) show that the former can enjoy provable guarantees under realizable function approximation, the latter is only known to be feasible under much stronger assumptions such as prohibitively expressive discriminators. In this work, we provide guarantees for off-policy function estimation under only realizability, by imposing proper regularization on the MIS objectives. Compared to commonly used regularization in MIS, our regularizer is much more flexible and can account for an arbitrary user-specified distribution, under which the learned function will be close to the groundtruth. We provide exact characterization of the optimal dual solution that needs to be realized by the discriminator class, which determines the data-coverage assumption in the case of value-function learning. As another surprising observation, the regularizer can be altered to relax the data-coverage requirement, and completely eliminate it in the ideal case with strong side information.
翻译:离政策评价往往提到两个相关的任务:估计政策的预期回报和估计其价值功能(或其他感兴趣的功能,如密度比率)。 虽然最近关于边缘化重要性抽样(MIS)的研究表明,前者在可实现功能近似下可以享有可实现的保障,但后者只有在过于强烈的假设下才被认为可行,例如过于强烈的表达歧视者。在这项工作中,我们通过对管理信息系统的目标进行适当的规范化,仅根据可实现性提供非政策功能估计的保障。与管理信息系统中常用的规范化相比,我们的正规化器更加灵活,可以说明任意用户指定的分布,在这种分布下,所学到的功能将接近地面。我们准确地描述了需要由歧视者阶层实现的最佳双重解决方案,该类别决定了在学习价值功能的情况下的数据覆盖假设。另一个令人惊讶的观察是,对常规化器可以进行修改,以放松数据覆盖要求,并在理想的案例中以强有力的侧面信息完全消除它。