长期政策学习:从因果关系的角度看问题 (Invariant Policy Learning: A Causal Perspective)

In the past decade, contextual bandit and reinforcement learning algorithms have been successfully used in various interactive learning systems such as online advertising, recommender systems, and dynamic pricing. However, they have yet to be widely adopted in high-stakes application domains, such as healthcare. One reason may be that existing approaches assume that the underlying mechanisms are static in the sense that they do not change over time or over different environments. In many real world systems, however, the mechanisms are subject to shifts across environments which may invalidate the static environment assumption. In this paper, we tackle the problem of environmental shifts under the framework of offline contextual bandits. We view the environmental shift problem through the lens of causality and propose multi-environment contextual bandits that allow for changes in the underlying mechanisms. We adopt the concept of invariance from the causality literature and introduce the notion of policy invariance. We argue that policy invariance is only relevant if unobserved confounders are present and show that, in that case, an optimal invariant policy is guaranteed, under certain assumptions, to generalize across environments. Our results do not only provide a solution to the environmental shift problem but also establish concrete connections among causality, invariance and contextual bandits.

翻译：过去十年来,背景强盗和强化学习算法在诸如在线广告、推荐系统和动态定价等各种互动学习系统中被成功使用。然而,这些算法尚未在保健等高取量应用领域被广泛采用。一个原因可能是,现有办法假定基本机制是静态的,因为它们不会随时间或不同环境而变化。然而,在许多现实世界体系中,机制会发生跨环境的变化,从而可能使静态环境假设无效。在本文中,我们在离线背景强盗的框架内处理环境变化问题。我们从因果关系的角度看待环境变化问题,并提出了允许基本机制变化的多环境背景强盗。我们采纳了因果性文献中的变异概念,并引入了政策变异的概念。我们争辩说,只有在存在未观察到的共创者的情况下,政策差异才具有相关性,并表明,在这种情况下,根据某些假设,保证一种最佳的变异性政策在环境中普遍化。我们的结果不仅为环境变化问题提供了解决办法,而且还在背景和因果性之间建立了具体的联系。