We propose algorithms based on a multi-level Thompson sampling scheme, for the stochastic multi-armed bandit and its contextual variant with linear expected rewards, in the setting where arms are clustered. We show, both theoretically and empirically, how exploiting a given cluster structure can significantly improve the regret and computational cost compared to using standard Thompson sampling. In the case of the stochastic multi-armed bandit we give upper bounds on the expected cumulative regret showing how it depends on the quality of the clustering. Finally, we perform an empirical evaluation showing that our algorithms perform well compared to previously proposed algorithms for bandits with clustered arms.
翻译:我们提出基于多层次Thompson抽样办法的算法,在武器集中的环境下,针对随机多武装土匪及其具有线性预期回报的背景变体,在理论上和实验上,我们展示了利用特定集束结构如何与使用标准的Thompson抽样相比极大地改善遗憾和计算成本。对于随机多武装土匪来说,我们给出了预期累积遗憾的上限,表明其如何取决于集束的质量。最后,我们进行了一项经验性评估,表明我们的算法与先前为集束武器土匪提议的算法相比表现良好。