Unlike traditional supervised learning, in many settings only partial feedback is available. We may only observe outcomes for the chosen actions, but not the counterfactual outcomes associated with other alternatives. Such settings encompass a wide variety of applications including pricing, online marketing and precision medicine. A key challenge is that observational data are influenced by historical policies deployed in the system, yielding a biased data distribution. We approach this task as a domain adaptation problem and propose a self-training algorithm which imputes outcomes with categorical values for finite unseen actions in the observational data to simulate a randomized trial through pseudolabeling, which we refer to as Counterfactual Self-Training (CST). CST iteratively imputes pseudolabels and retrains the model. In addition, we show input consistency loss can further improve CST performance which is shown in recent theoretical analysis of pseudolabeling. We demonstrate the effectiveness of the proposed algorithms on both synthetic and real datasets.
翻译:与传统的监督学习不同,在许多情况下,只有部分反馈,我们只能观察所选择的行动的结果,而不能观察其他替代方法的反事实结果。这种环境包括各种各样的应用,包括定价、在线营销和精准医学。一个关键挑战是观察数据受到系统采用的历史政策的影响,造成数据分布偏差。我们将此任务作为一个领域适应问题处理,并提出自我培训算法,用观察数据中有限的无形行动绝对值来估算结果,通过假标签模拟随机试验,我们称之为反事实自我培训(CST)。CST反复渗透假标签和重新使用模型。此外,我们显示出投入一致性损失,可以进一步改善CST在近期对假标签的理论分析中所显示的绩效。我们展示了合成和真实数据集的拟议算法的有效性。