Off-policy learning is a framework for evaluating and optimizing policies without deploying them, from data collected by another policy. Real-world environments are typically non-stationary and the offline learned policies should adapt to these changes. To address this challenge, we study the novel problem of off-policy optimization in piecewise-stationary contextual bandits. Our proposed solution has two phases. In the offline learning phase, we partition logged data into categorical latent states and learn a near-optimal sub-policy for each state. In the online deployment phase, we adaptively switch between the learned sub-policies based on their performance. This approach is practical and analyzable, and we provide guarantees on both the quality of off-policy optimization and the regret during online deployment. To show the effectiveness of our approach, we compare it to state-of-the-art baselines on both synthetic and real-world datasets. Our approach outperforms methods that act only on observed context.
翻译:离政策学习是评估和优化政策而不部署政策的框架,它来自另一项政策收集的数据。现实世界环境通常是非静止的,离线学习的政策应该适应这些变化。为了应对这一挑战,我们研究了非政策优化的新问题。为了应对这一挑战,我们研究了在零星静止背景土匪中脱离政策优化的新问题。我们提出的解决方案分为两个阶段。在离线学习阶段,我们将记录的数据分解成绝对隐蔽的状态,并为每个州学习近乎最佳的次级政策。在在线部署阶段,我们根据它们的表现适应了在所学的次政策之间的转换。这个方法既实用又可分析,我们为离政策优化的质量提供保障,也为在线部署过程中的遗憾提供了保障。为了展示我们的方法的有效性,我们将其与合成和真实世界数据集的最新基线进行比较。我们的方法超越了仅根据观察到的背景行事的方法。