长期政策学习:从因果关系的角度看问题 (Invariant Policy Learning: A Causal Perspective)

In the past decade, contextual bandit and reinforcement learning algorithms have been successfully used in various interactive learning systems such as online advertising, recommender systems, and dynamic pricing. However, they have yet to be widely adopted in high-stakes application domains, such as healthcare. One reason may be that existing approaches assume that the underlying mechanisms are static in the sense that they do not change over different environments. In many real world systems, however, the mechanisms are subject to shifts across environments which may invalidate the static environment assumption. In this paper, we tackle the problem of environmental shifts under the framework of offline contextual bandits. We view the environmental shift problem through the lens of causality and propose multi-environment contextual bandits that allow for changes in the underlying mechanisms. We adopt the concept of invariance from the causality literature and introduce the notion of policy invariance. We argue that policy invariance is only relevant if unobserved confounders are present and show that, in that case, an optimal invariant policy is guaranteed to generalize across environments under suitable assumptions. Our results may be a first step towards solving the environmental shift problem. They also establish concrete connections among causality, invariance and contextual bandits.

翻译：在过去的十年中,背景强盗和强化学习算法被成功地用于各种互动学习系统,如在线广告、推荐系统和动态定价等。然而,这些算法尚未在保健等高取量应用领域被广泛采用。一个原因可能是,现有办法假定基本机制是静态的,因为它们不会因不同环境而发生变化。然而,在许多现实世界系统中,机制会发生跨环境的变化,从而可能使静态环境假设无效。在本文件中,我们在离线背景强盗的框架内处理环境变化问题。我们从因果关系的角度看待环境变化问题,并提议多种环境强盗,从而允许基本机制的变化。我们采纳了因果性文献中的变异概念,并引入了政策变异的概念。我们指出,只有在存在未观察到的共犯存在的情况下,政策才具有相关性,并表明在这种情况下,最佳的变异性政策保证在适当的假设下在各种环境中普遍化。我们的结果可能是解决环境变化问题的第一步。我们还在背景和结构中建立因果关系的具体联系。