We study Thompson Sampling algorithms for stochastic multi-armed bandits in the batched setting, in which we want to minimize the regret over a sequence of arm pulls using a small number of policy changes (or, batches). We propose two algorithms and demonstrate their effectiveness by experiments on both synthetic and real datasets. We also analyze the proposed algorithms from the theoretical aspect and obtain almost tight regret-batches tradeoffs for the two-arm case.
翻译:我们研究Thompson 抽样算法,用于分批处理的多武装强盗,我们希望在其中利用少量政策变化(或分批处理)来尽量减少对一系列手臂拉动的遗憾。 我们提出两种算法,并通过合成和真实数据集实验来证明它们的有效性。 我们还从理论角度分析拟议的算法,并为两股武器案获得几乎十分严格的遗憾交换。