带有适应性推断的在线多武装强盗 (Online Multi-Armed Bandits with Adaptive Inference)

During online decision making in Multi-Armed Bandits (MAB), one needs to conduct inference on the true mean reward of each arm based on data collected so far at each step. However, since the arms are adaptively selected--thereby yielding non-iid data--conducting inference accurately is not straightforward. In particular, sample averaging, which is used in the family of UCB and Thompson sampling (TS) algorithms, does not provide a good choice as it suffers from bias and a lack of good statistical properties (e.g. asymptotic normality). Our thesis in this paper is that more sophisticated inference schemes that take into account the adaptive nature of the sequentially collected data can unlock further performance gains, even though both UCB and TS type algorithms are optimal in the worst case. In particular, we propose a variant of TS-style algorithms--which we call doubly adaptive TS--that leverages recent advances in causal inference and adaptively reweights the terms of a doubly robust estimator on the true mean reward of each arm. Through 20 synthetic domain experiments and a semi-synthetic experiment based on data from an A/B test of a web service, we demonstrate that using an adaptive inferential scheme (while still retaining the exploration efficacy of TS) provides clear benefits in online decision making: the proposed DATS algorithm has superior empirical performance to existing baselines (UCB and TS) in terms of regret and sample complexity in identifying the best arm. In addition, we also provide a finite-time regret bound of doubly adaptive TS that matches (up to log factors) those of UCB and TS algorithms, thereby establishing that its improved practical benefits do not come at the expense of worst-case suboptimality.

翻译：在多Armed Banits(MAB)的在线决策过程中,人们需要根据迄今为止收集的数据,根据每个步骤,对每个手臂的真正平均报酬值进行精确的推断。然而,由于武器是适应性选择的,因此产生非二数据进行分析的准确推论并不是直截了当的。特别是,在UCB和Thompson抽样算法的家族中使用的样本平均率并不能提供良好的选择,因为它有偏差,缺乏良好的统计属性(例如,缺乏正常性)。我们在本文中的论点是,考虑到按顺序收集的数据的适应性能的更精密的推断计划可以进一步获得绩效收益,尽管在最坏的情况下,UCB和TS类型算法都是最佳的。特别是,我们在UCB和Thompson抽样算法中使用的样本平均平均,它利用最近因果性和适应性差的统计性能(例如,无症状的正常性能)的精确度的估算性能。在每条手臂的精确性评估中,我们用一个精确的估算性估算值来确定一个精确的数值,在A合成领域进行一个测试时,从A-CB级的精确的实验中,我们用来显示一个精确的精确性实验的精确的逻辑上的实验,然后用一个精确性实验的逻辑的逻辑数据。