In this paper we consider Thompson Sampling (TS) for combinatorial semi-bandits. We demonstrate that, perhaps surprisingly, TS is sub-optimal for this problem in the sense that its regret scales exponentially in the ambient dimension, and its minimax regret scales almost linearly. This phenomenon occurs under a wide variety of assumptions including both non-linear and linear reward functions, with Bernoulli distributed rewards and uniform priors. We also show that including a fixed amount of forced exploration to TS does not alleviate the problem. We complement our theoretical results with numerical results and show that in practice TS indeed can perform very poorly in some high dimensional situations.
翻译:在本文中,我们考虑Thompson抽样(TS)作为组合式半匪帮。我们证明,也许令人惊讶的是,TS对于这一问题来说并不理想,因为它在环境层面的遗憾比例成倍上升,在环境层面的遗憾比例也几乎是线性。 这一现象发生在各种各样的假设下,包括非线性和线性奖励功能,伯努利分配了奖励和统一的前科。 我们还表明,包括固定数量的强制探索来给TS缓解不了问题。 我们用数字结果来补充我们的理论结果,并表明在实践中TS在某些高维度情况下的表现确实非常差。