Off-policy evaluation and learning (OPE/L) use offline observational data to make better decisions, which is crucial in applications where experimentation is necessarily limited. OPE/L is nonetheless sensitive to discrepancies between the data-generating environment and that where policies are deployed. Recent work proposed distributionally robust OPE/L (DROPE/L) to remedy this, but the proposal relies on inverse-propensity weighting, whose regret rates may deteriorate if propensities are estimated and whose variance is suboptimal even if not. For vanilla OPE/L, this is solved by doubly robust (DR) methods, but they do not naturally extend to the more complex DROPE/L, which involves a worst-case expectation. In this paper, we propose the first DR algorithms for DROPE/L with KL-divergence uncertainty sets. For evaluation, we propose Localized Doubly Robust DROPE (LDR$^2$OPE) and prove its semiparametric efficiency under weak product rates conditions. Notably, thanks to a localization technique, LDR$^2$OPE only requires fitting a small number of regressions, just like DR methods for vanilla OPE. For learning, we propose Continuum Doubly Robust DROPL (CDR$^2$OPL) and show that, under a product rate condition involving a continuum of regressions, it enjoys a fast regret rate of $\mathcal{O}(N^{-1/2})$ even when unknown propensities are nonparametrically estimated. We further extend our results to general $f$-divergence uncertainty sets. We illustrate the advantage of our algorithms in simulations.
翻译:非政策评价和学习(OPE/L)使用离线观测数据来作出更好的决定,这对于实验一定有限的应用来说至关重要。OPE/L对数据产生环境与政策部署环境之间的差异十分敏感。最近的工作提议分配上稳健的OPE/L(DROPE/L)(DROPE/L)来纠正这一点,但建议依赖反向反应权加权法,如果对倾向性作出估计,其差异即使不理想,其遗憾率可能会下降。对于香草OPE/L(DR)方法加倍强(DR)解决了这一点,但是它们并不自然延伸到更复杂的DROPE/L(DRO),这涉及到一个最坏的预期。在本文中,我们提议对DROPE/L(DR)进行地方化的罗柏(LO$2O(LO)值(LDR)值(美元)的半偏差率,我们提议在产品价格低的情况下,我们用不固定的RBRFR(R)的折率(RDR)值,我们用不固定的折率的折率表示一个小的折算法。