Counterfactual reasoning from logged data has become increasingly important for many applications such as web advertising or healthcare. In this paper, we address the problem of learning stochastic policies with continuous actions from the viewpoint of counterfactual risk minimization (CRM). While the CRM framework is appealing and well studied for discrete actions, the continuous action case raises new challenges about modelization, optimization, and~offline model selection with real data which turns out to be particularly challenging. Our paper contributes to these three aspects of the CRM estimation pipeline. First, we introduce a modelling strategy based on a joint kernel embedding of contexts and actions, which overcomes the shortcomings of previous discretization approaches. Second, we empirically show that the optimization aspect of counterfactual learning is important, and we demonstrate the benefits of proximal point algorithms and differentiable estimators. Finally, we propose an evaluation protocol for offline policies in real-world logged systems, which is challenging since policies cannot be replayed on test data, and we release a new large-scale dataset along with multiple synthetic, yet realistic, evaluation setups.
翻译:登录数据产生的反事实推理对于网络广告或医疗保健等许多应用越来越重要。 在本文中,我们从反事实风险最小化的观点出发,以持续的行动来研究随机政策。 虽然CRM框架具有吸引力,并且为分散的行动进行了深入研究,但持续的行动案例在模型化、优化和以真实数据选择离线模型方面提出了新的挑战,而事实证明真实数据尤其具有挑战性。我们的文件对CRM估算管道的这三个方面作出了贡献。首先,我们引入了基于环境和行动联合内嵌的模型战略,克服了先前的离散方法的缺点。第二,我们从经验上表明,反事实学习的优化方面很重要,我们展示了准点算法和不同估量器的好处。最后,我们提出了现实世界日志系统中离线政策的评价协议,因为政策无法在测试数据上重现,因此具有挑战性,我们推出了一个新的大型数据集,同时推出了多种合成的、但现实的评估设置。