We address policy learning with logged data in contextual bandits. Current offline-policy learning algorithms are mostly based on inverse propensity score (IPS) weighting requiring the logging policy to have \emph{full support} i.e. a non-zero probability for any context/action of the evaluation policy. However, many real-world systems do not guarantee such logging policies, especially when the action space is large and many actions have poor or missing rewards. With such \emph{support deficiency}, the offline learning fails to find optimal policies. We propose a novel approach that uses a hybrid of offline learning with online exploration. The online exploration is used to explore unsupported actions in the logged data whilst offline learning is used to exploit supported actions from the logged data avoiding unnecessary explorations. Our approach determines an optimal policy with theoretical guarantees using the minimal number of online explorations. We demonstrate our algorithms' effectiveness empirically on a diverse collection of datasets.
翻译:我们用背景强盗的登录数据解决政策学习问题。当前离线政策学习算法主要基于反常性评分(IPS),要求伐木政策要有\ emph{ 充分支持}, 也就是说,评价政策的任何上下文/行动都是非零概率的。然而,许多现实世界系统不能保证这种伐木政策,特别是当行动空间大,许多行动回报差或缺失时。有了这样的 emph{ 支持不足},离线学习算法无法找到最佳政策。我们提出了一个新颖的方法,在网上探索中采用离线学习的混合方法。在线探索用来探索登录数据中的不支持行动,而离线学习则用来利用日志数据中不必要探索的辅助行动。我们的方法确定了一种最佳政策,在理论上保证使用最少数量的在线勘探。我们用不同数据集收集的经验展示了我们的算法的有效性。