Policy learning using historical observational data is an important problem that has found widespread applications. Examples include selecting offers, prices, advertisements to send to customers, as well as selecting which medication to prescribe to a patient. However, existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment that has generated the data--an assumption that is often false or too coarse an approximation. In this paper, we lift this assumption and aim to learn a distributional robust policy with incomplete (bandit) observational data. We propose a novel learning algorithm that is able to learn a robust policy to adversarial perturbations and unknown covariate shifts. We first present a policy evaluation procedure in the ambiguous environment and then give a performance guarantee based on the theory of uniform convergence. Additionally, we also give a heuristic algorithm to solve the distributional robust policy learning problems efficiently. Finally, we demonstrate the robustness of our methods in the synthetic and real-world datasets.
翻译:使用历史观测数据进行政策学习是一个重要问题,已经得到广泛应用。例子包括选择报价、价格、向客户发送广告,以及选择向病人开药的药物。然而,现有文献所依据的关键假设是,今后运用所学政策的环境与以往产生数据-假设的环境相同,而以往的环境往往是假的或过于粗略。在本文中,我们取消这一假设,目的是学习一种分布式强的政策,其中含有不完整的(带式)观测数据。我们提出了一种新的学习算法,能够学习一种强有力的政策,以对抗性扰动和未知的共变换变化。我们首先在模糊的环境中提出政策评价程序,然后根据统一趋同理论提供业绩保证。此外,我们还给出了一种超理论算法,以有效解决分布式强的政策学习问题。最后,我们展示了我们在合成和真实世界数据集中的方法的稳健性。