We introduce a novel anytime Batched Thompson sampling policy for multi-armed bandits where the agent observes the rewards of her actions and adjusts her policy only at the end of a small number of batches. We show that this policy simultaneously achieves a problem dependent regret of order $O(\log(T))$ and a minimax regret of order $O(\sqrt{T\log(T)})$ while the number of batches can be bounded by $O(\log(T))$ independent of the problem instance over a time horizon $T$. We also show that in expectation the number of batches used by our policy can be bounded by an instance dependent bound of order $O(\log\log(T))$. These results indicate that Thompson sampling maintains the same performance in this batched setting as in the case when instantaneous feedback is available after each action, while requiring minimal feedback. These results also indicate that Thompson sampling performs competitively with recently proposed algorithms tailored for the batched setting. These algorithms optimize the batch structure for a given time horizon $T$ and prioritize exploration in the beginning of the experiment to eliminate suboptimal actions. We show that Thompson sampling combined with an adaptive batching strategy can achieve a similar performance without knowing the time horizon $T$ of the problem and without having to carefully optimize the batch structure to achieve a target regret bound (i.e. problem dependent vs minimax regret) for a given $T$.
翻译:我们为多武装匪徒引入了新颖的、随时随机的Batched Thompson抽样政策,代理商在时间范围内观察她的行动奖励,并在少数批量的末尾调整她的政策。我们显示,该政策同时实现的问题取决于对美元(log(T)美元)和美元(O(sqrt{T\log(T))美元)的批量的遗憾,而对美元(O)(sqrt{T)(T)美元)的遗憾,而批量的数量则可以不受问题实例约束,在时间跨度范围内不受问题影响。我们还表明,预期我们的政策使用的批量数量可以受一个依附于O(\log\log(T)美元)的试卷约束。这些结果表明,Thompson抽样在每次行动之后都保持与该批量的成绩相同的差异,同时需要极少的反馈。这些结果还表明,Thompson抽样与最近为分批量环境而提出的算法是竞争性的。这些算法优化了给定时间跨度的批量结构,在开始实验时,先要优先进行探索时段,先要先要先得知道怎样的,然后才行动,然后才能才能完成。我们可以实现一个最优度,然后了解一个最优度,然后才能的排序。我们要。我们要行动,然后才能完成。我们展示。我们可以实现一个不至最优度,不用地步步步步步步步步步。我们要,不达。我们。