Thompson sampling (TS) has emerged as a robust technique for contextual bandit problems. However, TS requires posterior inference and optimization for action generation, prohibiting its use in many internet applications where latency and ease of deployment are of concern. We propose a novel imitation-learning-based algorithm that distills a TS policy into an explicit policy representation by performing posterior inference and optimization offline. The explicit policy representation enables fast online decision-making and easy deployment in mobile and server-based environments. Our algorithm iteratively performs offline batch updates to the TS policy and learns a new imitation policy. Since we update the TS policy with observations collected under the imitation policy, our algorithm emulates an off-policy version of TS. Our imitation algorithm guarantees Bayes regret comparable to TS, up to the sum of single-step imitation errors. We show these imitation errors can be made arbitrarily small when unlabeled contexts are cheaply available, which is the case for most large-scale internet applications. Empirically, we show that our imitation policy achieves comparable regret to TS, while reducing decision-time latency by over an order of magnitude. Our algorithm is deployed in video upload systems at Facebook and Instagram and is handling millions of uploads each day.
翻译:汤普森抽样(TS)已成为针对背景强盗问题的有力技术。然而,TS要求事后推断和优化行动生成,禁止在许多互联网应用程序中使用这种推断和优化,因为人们担心潜伏和易于部署。我们提出一种新的仿照学习算法,通过进行后推推推和离线优化,将TS政策提炼为明确的政策代表。明确的政策代表制使得快速在线决策和容易在移动和服务器环境中部署。我们的算法迭接地对TS政策进行离线批次更新,并学习新的仿制政策。自从我们用根据模仿政策收集的观察来更新TS政策以来,我们的算法模仿了TS的非政策版本。我们的仿制算法保证了比TS的遗憾,甚至相当于单步模仿错误的总和。我们展示了这些仿制错误可以被任意缩小,因为没有标签的环境是廉价的,这是大多数大规模互联网应用的情况。我们想象我们仿制政策取得了与STS政策可比的遗憾,而我们每次在上传到上百万次的图像系统,同时降低我们上层的上传的级别。