Counterfactual risk minimization is a framework for offline policy optimization with logged data which consists of context, action, propensity score, and reward for each sample point. In this work, we build on this framework and propose a learning method for settings where the rewards for some samples are not observed, and so the logged data consists of a subset of samples with unknown rewards and a subset of samples with known rewards. This setting arises in many application domains, including advertising and healthcare. While reward feedback is missing for some samples, it is possible to leverage the unknown-reward samples in order to minimize the risk, and we refer to this setting as semi-counterfactual risk minimization. To approach this kind of learning problem, we derive new upper bounds on the true risk under the inverse propensity score estimator. We then build upon these bounds to propose a regularized counterfactual risk minimization method, where the regularization term is based on the logged unknown-rewards dataset only; hence it is reward-independent. We also propose another algorithm based on generating pseudo-rewards for the logged unknown-rewards dataset. Experimental results with neural networks and benchmark datasets indicate that these algorithms can leverage the logged unknown-rewards dataset besides the logged known-reward dataset.
翻译:事实风险最小化是使用记录数据实现离线政策优化的框架, 包括背景、 行动、 倾向性评分和对每个抽样点的奖励。 在这项工作中, 我们以这个框架为基础, 并为一些样本的奖赏没有被观察到的设置提出学习方法, 因此登录数据由一组样本组成, 这些样本有未知的奖赏和一组已知的奖赏。 这个设置出现在许多应用领域, 包括广告和医疗保健。 虽然一些样本缺少奖励反馈, 但有可能利用未知的奖赏样本来尽量减少风险, 我们将此设置称为半反向风险最小化。 为了处理这种学习问题, 我们从反向偏向偏向评分的估量下的真实风险中获取新的上限。 然后我们利用这些框来提出一个正规化的反事实风险最小化方法, 包括广告和医疗保健。 这个规范术语仅基于登录的未知的奖赏数据集, 因此它取决于奖赏性。 我们还提议另一种算法, 以生成虚假的反向向向向上记录的风险最小化的风险最小化的半反向最小化风险最小化的半反向风险最小化的数据最小化的数据 。 实验性数据比值比值 。 我们用这些未知的实验性数据序列显示的对未知的数据比值 。