We consider local kernel metric learning for off-policy evaluation (OPE) of deterministic policies in contextual bandits with continuous action spaces. Our work is motivated by practical scenarios where the target policy needs to be deterministic due to domain requirements, such as prescription of treatment dosage and duration in medicine. Although importance sampling (IS) provides a basic principle for OPE, it is ill-posed for the deterministic target policy with continuous actions. Our main idea is to relax the target policy and pose the problem as kernel-based estimation, where we learn the kernel metric in order to minimize the overall mean squared error (MSE). We present an analytic solution for the optimal metric, based on the analysis of bias and variance. Whereas prior work has been limited to scalar action spaces or kernel bandwidth selection, our work takes a step further being capable of vector action spaces and metric optimization. We show that our estimator is consistent, and significantly reduces the MSE compared to baseline OPE methods through experiments on various domains.
翻译:我们考虑在具有连续行动空间的背景强盗中,对确定性政策进行非政策评价的地方核心指标学习(OPE),我们的工作受到实际设想的驱动,即目标政策由于领域要求而需要确定性,如治疗剂量和医学持续时间的处方,尽管重要的抽样(IS)为OPE提供了基本原则,但对于确定性目标政策来说却缺乏持续行动。我们的主要想法是放松目标政策,并造成以内核为基础的估计问题,我们学习内核指标,以尽量减少总体平均平方错误(MSE)。我们根据对偏差和差异的分析,提出了最佳指标的解析性解决办法。虽然先前的工作限于卡拉行动空间或内核带宽选择,但我们的工作进一步迈出了一步,能够利用矢量行动空间和指标优化。我们表明,我们的估测器是一致的,并且通过在各个领域的实验,大大降低了与基本OPE方法相比的MSE。