标题：在具有混淆偏差和缺失观测的情境赌博机上的策略学习的统一框架摘要：我们研究了基于观测数据获取最优策略的离线情境赌博机问题。然而，这些数据通常存在两个缺点：（i）一些影响策略选择的变量未被观测到，（ii）在收集的数据中存在缺失观测值。未被观测到的混淆因素会导致混淆偏差，缺失观测会导致偏差和低效问题。为了克服这些挑战并从观测数据集中学习最优策略，我们提出了一种新算法，称为Causal-Adjusted Pessimistic（CAP）策略学习，其将奖励函数构建为积分方程系统的解，构建置信区间，并使用悲观法贪心地采取行动。在对数据做出温和假设的条件下，我们为离线情境赌博机问题开发了一个上界，用于衡量CAP的次优性。 (A Unified Framework of Policy Learning for Contextual Bandit with Confounding Bias and Missing Observations)

翻译：标题：在具有混淆偏差和缺失观测的情境赌博机上的策略学习的统一框架摘要：我们研究了基于观测数据获取最优策略的离线情境赌博机问题。然而，这些数据通常存在两个缺点：（i）一些影响策略选择的变量未被观测到，（ii）在收集的数据中存在缺失观测值。未被观测到的混淆因素会导致混淆偏差，缺失观测会导致偏差和低效问题。为了克服这些挑战并从观测数据集中学习最优策略，我们提出了一种新算法，称为Causal-Adjusted Pessimistic（CAP）策略学习，其将奖励函数构建为积分方程系统的解，构建置信区间，并使用悲观法贪心地采取行动。在对数据做出温和假设的条件下，我们为离线情境赌博机问题开发了一个上界，用于衡量CAP的次优性。

Siyu Chen,Yitan Wang,Zhaoran Wang,Zhuoran Yang

from arxiv, 76 page, 5 figures

We study the offline contextual bandit problem, where we aim to acquire an optimal policy using observational data. However, this data usually contains two deficiencies: (i) some variables that confound actions are not observed, and (ii) missing observations exist in the collected data. Unobserved confounders lead to a confounding bias and missing observations cause bias and inefficiency problems. To overcome these challenges and learn the optimal policy from the observed dataset, we present a new algorithm called Causal-Adjusted Pessimistic (CAP) policy learning, which forms the reward function as the solution of an integral equation system, builds a confidence set, and greedily takes action with pessimism. With mild assumptions on the data, we develop an upper bound to the suboptimality of CAP for the offline contextual bandit problem.

翻译：